Forum: An entertaining read about BWA, Picard, GATK
2
gravatar for apozhitkov
9 months ago by
apozhitkov50
apozhitkov50 wrote:

Objective The objective is to implement a variant discovery and annotation pipeline similar to one published by Valle et al., (BMC Bioinformatics 2016, 17, 341) and compare the results with those produced by NextGene. The NextGene results were experimentally verified with known variants as well as a DNA dilution series to assess allele frequency quantification (Misyura et al., J of Mol. Diag., 2016, 18, 842). Also, test two versions of the human genome, i.e., hg19 from UCSC and the hg19 from 1000 Genomes. The required Galaxy tools are Fastqgroomer, BWA-MEM, Picard, GATK 3.0, Annovar.

“The Process”

  1. Starting with a fresh install of Galaxy (February 2017).
  2. Created a new database in Postgres and let Galaxy use it.
  3. Obtain human_g1k_v37.fasta.gz from NCBI and upload to Galaxy.
  4. Install data_manager_fetch_genome_dbkeys_all_fasta, which is needed to populate all_fasta data table within Galaxy, which is in turn needed for the other tools. Success
  5. Run data_manager_fetch_genome_dbkeys_all_fasta with human_g1k_v37.fasta.gz Success
  6. Run data_manager_fetch_genome_dbkeys_all_fasta and tell it to fetch hg19 from UCSC FTP. The ftp wants a “dbkey”, hg19 is provided (as a guess). Success
  7. Install fastqgroomer, bwa, data_manager_bwa_index_builder, data_manager_bwa_mem_index_builder. Success :-)
  8. Installing GATK 3.0. Fail : ‘gdata’ not available in ‘gplot’ in the dependency package_r_for_gatk_3_4_0. What?? Uninstall GATK, reinstall, fail
  9. Digging in: according to Galaxy, the faulty package is in /…/galaxy17/shed_tools/toolshed.g2.bx.psu.edu/repos/avowinkel/package_r_for_gatk_3_4_0/49c62e9b71ad/package_r_for_gatk_3_4_0 , there is a file tool_dependencies.xml
  10. Edit: vi tool_dependencies.xml; found lots of links to github to download various R packages.
  11. Searching Google about R, gplot and gdata. People noticed circular dependency of these two packages, no solution suggested.
  12. OK, when in doubt, cut it out. Removed the gplot from the list of package dependencies.
  13. Reinstall package_r_for_gatk_3_4_0, success.
  14. Now, need to build bwa-mem indexes for the bwa-mem to run. Running data_manager_bwa_mem_index_builder with hg19g100. Fail Something in “subprocess.py” –can’t open file
  15. Digging into the subprocess.py and the files calling it. ‘samtools” is hardcoded, not in PATH.
  16. Found where samtools is, found an env.sh file there.
  17. Added to the .bash_profile a call: source env.sh
  18. Restart Galaxy
  19. Running data_manager_bwa_mem_index_builder with hg19g100. Success
  20. Running data_manager_bwa_mem_index_builder with hg19UCSC. Fail “Cannot allocate 0 bytes”. O-oh
  21. Inspecting hg19UCSC fasta file. Completely corrupt, does not even look like fasta!
  22. Go to UCSC, attempt to find the hg19. Confusing stuff all around. Finally, there is a ling to FTP.
  23. Got to FTP, the hg19 is a tar.gz of individual chromosomes and unassembled regions. Where is a single file? Nowhere to be found.
  24. Download the “chromFa.tar.gz”. tar -xfz chromFa.tar.gz
  25. Concatenate all chromosomes into one file: find . -type f -name '*.fa' -exec cat {} + >> hg19ucsc.fa
  26. Go to /…/galaxy17/galaxy/tool-data/toolshed.g2.bx.psu.edu/repos/devteam/data_manager_fetch_genome_dbkeys_all_fasta/86fa71e9b427/all_fasta.loc determine where the original ucsc h19 was, remove the file, remove the mention of it from the .loc file
  27. Rerun data_manager_fetch_genome_dbkeys_all_fasta and data_manager_bwa_mem_index_builder with the newly uploaded hg19ucsc.fa Success
  28. OK, let’s try fastq files from real patients. Run Fastqgroomer. Success
  29. Run bwa-mem. Fail - can’t see indexes.
  30. Hmmm. Let’s try data_manager_bwa_index_builder on both genomes. Success
  31. Run bwa-mem. Success Strange, I thought bwa-mem indexes were supposed to be used. Oh well…
  32. Run SortSam (Picard). Success
  33. Run MarkDuplicates (Picard). Success
  34. Run IndelRealignment (GATK). Fail - can’t see reference genome.
  35. Install and run data_manager_gatk_picard_index_builder. Success
  36. Run IndelRealignment (GATK). Fail - can’t see reference genome. WHY???
  37. Googling, a Biostar forum expert gives a nonsensical advice about the very same problem someone else already encountered.
  38. Dig into /…/galaxy17/shed_tools/toolshed.g2.bx.psu.edu/repos/avowinkel/gatk/b80ff7f43ad1/gatk/gatk.xml
  39. The following tags are discovered: <param name="ref_file" type="select" label="Using reference genome" help="-R,‑‑reference_sequence &amp;lt;reference_sequence&amp;gt;"> <options from_data_table="picard_indexes"/>
  40. Sooo, it is pickard_indexes we need, not the ones produced by data_manager_gatk_picard_index_builder
  41. Install data_manager_picard_index_builder for both hg19 genomes. Success
  42. Run data_manager_picard_index_builder Fail “Uncaught exception in exposed API”
  43. Galaxy restart, run data_manager_picard_index_builder again. Success
  44. Run IndelRealignment. Sees reference genomes. Fail Needs RealignTargetCreator to be run first
  45. Run RealignTargetCreator Can’t find GenomeAnalysisTK.jar
  46. Back to the admin screen of Galaxy, manage installed tools, GATK, lots of stuff, there is a second GATK button, click
  47. See the command: java -Xmx\${GATK_MEM:-\${SLURM_MEM_PER_NODE:-4096}}M -jar "\$GATK_PATH/GenomeAnalysisTK.jar" \${THREAD_STRING:-}
  48. GATK_PATH is shown as a button in Galaxy admin screen, which points to ./database/dependencies/environment_settings/GATK_PATH/avowinkel/gatk/b80ff7f43ad1/env.sh
  49. Within the env.sh, there is a totally wrong path to nowhere GATK_PATH=/mnt/…; export GATK_PATH
  50. Search for GenomeAnalysisTK.jar. Found in ./galaxy17/galaxy/database/dependencies/gatk/1.4/devteam/package_gatk_1_4/ec95ec570854/jars/GenomeAnalysisTK.jar
  51. Hmmm, sounds like a previous version. Don’t remember installing that, but OK. Fixed ./database/dependencies/environment_settings/GATK_PATH/avowinkel/gatk/b80ff7f43ad1/env.sh
  52. Run RealignTargetCreator. Fail Crash-boom! Exceptions everywhere…
  53. Go to Broad institute and download GenomeAnalysisTK.jar
  54. Place it into /home/…/gatk_broad
  55. Adjust ./database/dependencies/environment_settings/GATK_PATH/avowinkel/gatk/b80ff7f43ad1/env.sh
  56. Run RealignTargetCreator. Fail Can’t find some special index file .fai. That .fai exists in the directory referred to by the data_manager_gatk_picard_index_builder !
  57. OK, let’s tell gatk to look into the gatk_picard_indexes table
  58. Edit: vi /…/galaxy17/shed_tools/toolshed.g2.bx.psu.edu/repos/avowinkel/gatk/b80ff7f43ad1/gatk/gatk.xml
  59. Run RealignTargetCreator. Fail Found the .fai file, now .dict file isn’t found. Well, there is a .dict, but with a wrong name. Renamed to the expected name.
  60. Run RealignTargetCreator. Fail “SAM/BAM/CRAM file input.bam is malformed. Please see https://software.broadinstitute.org/gatk/documentation/article?id=1317for more information. Error details: SAM file doesn't have any read groups defined in the header. The GATK no longer supports SAM files without read groups” Lovely… At least something is running.
  61. Run AddOrReplaceReadGroups (this could have been done earlier, but ok for the test now)
  62. Run RealignTargetCreator. Fail “The contig order in reads and reference is not the same; to fix this please see: (https://www.broadinstitute.org/gatk/guide/article?id=1328), which describes reordering contigs in BAM and VCF files.. reads contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9…]; reference contigs = [MT, 1, 2, 3, 4, 5, 6, 7, 8, 9…]”
  63. Need to run ReorderSam (Picard), which takes picard_index as reference. Not the gatk_picard_index, which makes it inconsistent. In fact /…/galaxy17/shed_tools/toolshed.g2.bx.psu.edu/repos/devteam/picard/6741a8ace658/picard/ picard_ReorderSam.xml shows that the table for indexes is picard_indexes.
  64. Run RealignTargetCreator. Same Fail
  65. Replaced picard_index with gatk_picard_index.
  66. Rerun ReorderSam. Success
  67. Run RealignTargetCreator. Success
  68. Run IndelRealigner (GATK). Success
  69. Run HaplotypeCaller (GATK). Still running
ADD COMMENTlink modified 5 months ago by linying91140 • written 9 months ago by apozhitkov50
2
gravatar for Guy Reeves
9 months ago by
Guy Reeves1.0k
Germany
Guy Reeves1.0k wrote:

Hi Alex We have two galaxy instances one conventional and one with docker. When I have had problems with installing tools on our main conventional one I mostly find that the galaxy docker instance has no problem. We would use only docker but we have issues with updating a number of libraries which docker requires. Docker will only take a short time to deploy I would give it a go, you may even find a flavour with most of the tools you want. https://github.com/bgruening/docker-galaxy-stable Cheers Guy

ADD COMMENTlink written 9 months ago by Guy Reeves1.0k

Trying to use GATK on Galaxy is not a path to happiness, the conditions placed on GATK use do not fit will the open source approach of galaxy and so it has not been well ssupportedfor a longtime .

ADD REPLYlink written 9 months ago by Guy Reeves1.0k

Hi Guy, Thanks for the hints! I haven't heard of the dockers before... As for the GATK, indeed, the Broad institute makes you sign something that you are at a non-profit or at a university. It applies to me, so I got their stuff... Alex.

ADD REPLYlink written 9 months ago by apozhitkov50
1
gravatar for Martin Čech
9 months ago by
Martin Čech ♦♦ 4.2k
United States
Martin Čech ♦♦ 4.2k wrote:

Hi apozhitkov,

which parts were the most difficult for you to figure out? Do you have any suggestions for us?

Thanks, M.

ADD COMMENTlink written 9 months ago by Martin Čech ♦♦ 4.2k

Hi Martin, The most difficult part was digging through the maze of the xml files that determine the tool parameters, configuration and requirements. I have two suggestions. First, the process of finding and adjusting tool configuration needs to be much more compact and uniform. Say ONE xml file of dependencies and ONE of tool command / parameters / configuration with appropriate sections / tags for each tool. Second, the tool shed needs regular vetting. The GATK experience was a nightmare. The tool shed says: GATK 3.0. When I installed it - not even close to GATK 3.0 (see my fun with it above). Alex.

ADD REPLYlink written 9 months ago by apozhitkov50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 84 users visited in the last hour