Objective The objective is to implement a variant discovery and annotation pipeline similar to one published by Valle et al., (BMC Bioinformatics 2016, 17, 341) and compare the results with those produced by NextGene. The NextGene results were experimentally verified with known variants as well as a DNA dilution series to assess allele frequency quantification (Misyura et al., J of Mol. Diag., 2016, 18, 842). Also, test two versions of the human genome, i.e., hg19 from UCSC and the hg19 from 1000 Genomes. The required Galaxy tools are Fastqgroomer, BWA-MEM, Picard, GATK 3.0, Annovar.
“The Process”
- Starting with a fresh install of Galaxy (February 2017).
- Created a new database in Postgres and let Galaxy use it.
- Obtain human_g1k_v37.fasta.gz from NCBI and upload to Galaxy.
- Install data_manager_fetch_genome_dbkeys_all_fasta, which is needed to populate all_fasta data table within Galaxy, which is in turn needed for the other tools. Success
- Run data_manager_fetch_genome_dbkeys_all_fasta with human_g1k_v37.fasta.gz Success
- Run data_manager_fetch_genome_dbkeys_all_fasta and tell it to fetch hg19 from UCSC FTP. The ftp wants a “dbkey”, hg19 is provided (as a guess). Success
- Install fastqgroomer, bwa, data_manager_bwa_index_builder, data_manager_bwa_mem_index_builder. Success :-)
- Installing GATK 3.0. Fail : ‘gdata’ not available in ‘gplot’ in the dependency package_r_for_gatk_3_4_0. What?? Uninstall GATK, reinstall, fail
- Digging in: according to Galaxy, the faulty package is in /…/galaxy17/shed_tools/toolshed.g2.bx.psu.edu/repos/avowinkel/package_r_for_gatk_3_4_0/49c62e9b71ad/package_r_for_gatk_3_4_0 , there is a file tool_dependencies.xml
- Edit: vi tool_dependencies.xml; found lots of links to github to download various R packages.
- Searching Google about R, gplot and gdata. People noticed circular dependency of these two packages, no solution suggested.
- OK, when in doubt, cut it out. Removed the gplot from the list of package dependencies.
- Reinstall package_r_for_gatk_3_4_0, success.
- Now, need to build bwa-mem indexes for the bwa-mem to run. Running data_manager_bwa_mem_index_builder with hg19g100. Fail Something in “subprocess.py” –can’t open file
- Digging into the subprocess.py and the files calling it. ‘samtools” is hardcoded, not in PATH.
- Found where samtools is, found an env.sh file there.
- Added to the .bash_profile a call: source env.sh
- Restart Galaxy
- Running data_manager_bwa_mem_index_builder with hg19g100. Success
- Running data_manager_bwa_mem_index_builder with hg19UCSC. Fail “Cannot allocate 0 bytes”. O-oh
- Inspecting hg19UCSC fasta file. Completely corrupt, does not even look like fasta!
- Go to UCSC, attempt to find the hg19. Confusing stuff all around. Finally, there is a ling to FTP.
- Got to FTP, the hg19 is a tar.gz of individual chromosomes and unassembled regions. Where is a single file? Nowhere to be found.
- Download the “chromFa.tar.gz”. tar -xfz chromFa.tar.gz
- Concatenate all chromosomes into one file: find . -type f -name '*.fa' -exec cat {} + >> hg19ucsc.fa
- Go to /…/galaxy17/galaxy/tool-data/toolshed.g2.bx.psu.edu/repos/devteam/data_manager_fetch_genome_dbkeys_all_fasta/86fa71e9b427/all_fasta.loc determine where the original ucsc h19 was, remove the file, remove the mention of it from the .loc file
- Rerun data_manager_fetch_genome_dbkeys_all_fasta and data_manager_bwa_mem_index_builder with the newly uploaded hg19ucsc.fa Success
- OK, let’s try fastq files from real patients. Run Fastqgroomer. Success
- Run bwa-mem. Fail - can’t see indexes.
- Hmmm. Let’s try data_manager_bwa_index_builder on both genomes. Success
- Run bwa-mem. Success Strange, I thought bwa-mem indexes were supposed to be used. Oh well…
- Run SortSam (Picard). Success
- Run MarkDuplicates (Picard). Success
- Run IndelRealignment (GATK). Fail - can’t see reference genome.
- Install and run data_manager_gatk_picard_index_builder. Success
- Run IndelRealignment (GATK). Fail - can’t see reference genome. WHY???
- Googling, a Biostar forum expert gives a nonsensical advice about the very same problem someone else already encountered.
- Dig into /…/galaxy17/shed_tools/toolshed.g2.bx.psu.edu/repos/avowinkel/gatk/b80ff7f43ad1/gatk/gatk.xml
- The following tags are discovered: <param name="ref_file" type="select" label="Using reference genome" help="-R,‑‑reference_sequence &lt;reference_sequence&gt;"> <options from_data_table="picard_indexes"/>
- Sooo, it is pickard_indexes we need, not the ones produced by data_manager_gatk_picard_index_builder
- Install data_manager_picard_index_builder for both hg19 genomes. Success
- Run data_manager_picard_index_builder Fail “Uncaught exception in exposed API”
- Galaxy restart, run data_manager_picard_index_builder again. Success
- Run IndelRealignment. Sees reference genomes. Fail Needs RealignTargetCreator to be run first
- Run RealignTargetCreator Can’t find GenomeAnalysisTK.jar
- Back to the admin screen of Galaxy, manage installed tools, GATK, lots of stuff, there is a second GATK button, click
- See the command: java -Xmx\${GATK_MEM:-\${SLURM_MEM_PER_NODE:-4096}}M -jar "\$GATK_PATH/GenomeAnalysisTK.jar" \${THREAD_STRING:-}
- GATK_PATH is shown as a button in Galaxy admin screen, which points to ./database/dependencies/environment_settings/GATK_PATH/avowinkel/gatk/b80ff7f43ad1/env.sh
- Within the env.sh, there is a totally wrong path to nowhere GATK_PATH=/mnt/…; export GATK_PATH
- Search for GenomeAnalysisTK.jar. Found in ./galaxy17/galaxy/database/dependencies/gatk/1.4/devteam/package_gatk_1_4/ec95ec570854/jars/GenomeAnalysisTK.jar
- Hmmm, sounds like a previous version. Don’t remember installing that, but OK. Fixed ./database/dependencies/environment_settings/GATK_PATH/avowinkel/gatk/b80ff7f43ad1/env.sh
- Run RealignTargetCreator. Fail Crash-boom! Exceptions everywhere…
- Go to Broad institute and download GenomeAnalysisTK.jar
- Place it into /home/…/gatk_broad
- Adjust ./database/dependencies/environment_settings/GATK_PATH/avowinkel/gatk/b80ff7f43ad1/env.sh
- Run RealignTargetCreator. Fail Can’t find some special index file .fai. That .fai exists in the directory referred to by the data_manager_gatk_picard_index_builder !
- OK, let’s tell gatk to look into the gatk_picard_indexes table
- Edit: vi /…/galaxy17/shed_tools/toolshed.g2.bx.psu.edu/repos/avowinkel/gatk/b80ff7f43ad1/gatk/gatk.xml
- Run RealignTargetCreator. Fail Found the .fai file, now .dict file isn’t found. Well, there is a .dict, but with a wrong name. Renamed to the expected name.
- Run RealignTargetCreator. Fail “SAM/BAM/CRAM file input.bam is malformed. Please see https://software.broadinstitute.org/gatk/documentation/article?id=1317for more information. Error details: SAM file doesn't have any read groups defined in the header. The GATK no longer supports SAM files without read groups” Lovely… At least something is running.
- Run AddOrReplaceReadGroups (this could have been done earlier, but ok for the test now)
- Run RealignTargetCreator. Fail “The contig order in reads and reference is not the same; to fix this please see: (https://www.broadinstitute.org/gatk/guide/article?id=1328), which describes reordering contigs in BAM and VCF files.. reads contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9…]; reference contigs = [MT, 1, 2, 3, 4, 5, 6, 7, 8, 9…]”
- Need to run ReorderSam (Picard), which takes picard_index as reference. Not the gatk_picard_index, which makes it inconsistent. In fact /…/galaxy17/shed_tools/toolshed.g2.bx.psu.edu/repos/devteam/picard/6741a8ace658/picard/ picard_ReorderSam.xml shows that the table for indexes is picard_indexes.
- Run RealignTargetCreator. Same Fail
- Replaced picard_index with gatk_picard_index.
- Rerun ReorderSam. Success
- Run RealignTargetCreator. Success
- Run IndelRealigner (GATK). Success
- Run HaplotypeCaller (GATK). Still running