Exome Sequence Analysis Question - using GATK Unified Genotyper

Question: Exome Sequence Analysis Question - using GATK Unified Genotyper

4.2 years ago by

United Kingdom

hugo_gale • 0 wrote:

Hi everyone,

I'm tring to call SNVs and short Indels using the GATK Unified Genotyper. I have previously selected BOTH under the 'Genotype liklihoods calculation model to employ' but when I select this for my new sample it is not working. I've tried running the SNP and INDEL options separately, and only the SNP returns the VCF file for annotation the INDEL doesn't work.

I was wondering if anyone else has had this problem? I'm very new to bioinformatics and I am trying to work out if there is something wrong with the data I am using or if the tool isn't working at the moment.

Any help would be greatly appreciated.

Claire

gatk unified genotyper indel galaxy • 1.7k views

ADD COMMENT • link •

modified 4.2 years ago by Jennifer Hillman Jackson ♦ 25k • written 4.2 years ago by hugo_gale • 0

4.2 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hi Claire,

Thanks for sending in the bug reports, helpful to examine the actual data for this one.

The dbSNP ROD file used in the analysis is lacking the full annotation required by the pipeline. In particular, lines that define the genome content are missing (triggering the empty array error you encountered). Lines such as:

##contig=<ID=1,length=249250621,assembly=b37>

##contig=<ID=10,length=135534747,assembly=b37>

##contig=<ID=11,length=135006516,assembly=b37>

A better choice for a dbSNP ROD dataset is the one provided with the GATK bundle. You can obtain this directly from the Broad, or when working on the public Main Galaxy instance, use the copy in the 'Shared Data -> Data Library -> GATK" bundle datasets.

As a secondary issue, I believe it is important to use the dbSNP vcf file directly in any analysis (instead of prior processed vcf datasets, as I saw in one of your runs). If you would like to merge vcf files later, there is a tool for that: NGS: VCF Manipulation -> VCFcombine

Hopefully this helps resolve your issue, and helps guide to others that are learning GATK pipeline about the resources available on Main (http://usegalaxy.org). If our team has more to add specific to you situation, we will comment again and/or reply via email.

Take care, Jen, Galaxy team

ADD COMMENT • link written 4.2 years ago by Jennifer Hillman Jackson ♦ 25k

Update: A full test run using the GATK bundle dbSNP dataset and your data resulted in the same failure again. It is not clear if this indicates if the job is too large to run on the public Main Galaxy instance, but that is a good possibility. I updated our Trello card for this tool to reflect the issue and our development team will be reviewing.
http://trello.com/c/plQYSCvS

Meanwhile, you can try the job out on a cloud Galaxy with more resources. This may be a good idea anyway if you have a large amount of data to process. There are also updated wrappers in the Tool Shed appropriate for cloud use, if you want to explore those. Here is how to get started:
http://usegalaxy.org/cloud
http://usegalaxy.org/toolshed

Best, Jen, Galaxy team

ADD REPLY • link written 4.2 years ago by Jennifer Hillman Jackson ♦ 25k

Similar posts • Search »