I recently got on a project that is attempting to find novel mutations in certain diseases by comparing genomic data between diseased tissue and blood. After sending our samples off to get sequenced, we received BAM files that are already pre aligned to Hg19. I am trying to use the following protocol from a previous study that was conducted with the same information:
Prevariant processing:
- Paired end reads were gap aligned to hg19 using BWA (Burrows–Wheeler Aligner).
- Poorly aligned/mapped reads were filtered away with Samtools; SAM to BAM conversion was done.
- PCR duplicates were marked and removed with the Picard package.
- Indel realignment with known sites and base quality score recalibration were performed with GATK (Genome Analysis Toolkit), in line with current best practices in the next-generation sequencing field for variant detection, to produce variant-caller ready reads in BAM format
Variant calling: identification of somatic substitutions and short indels
- Somatic single-nucleotide variants were called with MuTect (beta) (https://confluence.broadinstitute.org/display/CGATools/MuTect).
- Single-nucleotide variants reported in dbSNP129 (the last accepted “pure” version of dbSNP) were removed, unless they were also present in COSMICv56 (Forbes et al., 2008).
- Somatic short indels were called with the Somatic Indel Detector walker that is part of the GATK package (https://confluence.broadinstitute.org/display/CGATools/Indelocator). Both these programs take in paired tumor-normal BAMs as input.
- The single-nucleotide variants and indels were annotated with Oncotator (http://www.broadinstitute.org/oncotator/), a rapid and accurate web-based annotation tool.
I'm unable to complete step 4 of the prevariant processing, where I remove Indels using GATK on Galaxy. I'm unable to use a reference genome, even though i've tried uploading Hg19.fasta. Can anyone help me figure out what's going wrong? Thanks!