Question: Exome Sequence Analysis Question - using GATK Unified Genotyper
0
gravatar for hugo_gale
4.2 years ago by
hugo_gale0
United Kingdom
hugo_gale0 wrote:

Hi everyone,

I'm tring to call SNVs and short Indels using the GATK Unified Genotyper. I have previously selected BOTH under the 'Genotype liklihoods calculation model to employ' but when I select this for my new sample it is not working. I've tried running the SNP and INDEL options separately, and only the SNP returns the VCF file for annotation the INDEL doesn't work.

I was wondering if anyone else has had this problem? I'm very new to bioinformatics and I am trying to work out if there is something wrong with the data I am using or if the tool isn't working at the moment.

Any help would be greatly appreciated.

Claire

ADD COMMENTlink modified 4.2 years ago by Jennifer Hillman Jackson25k • written 4.2 years ago by hugo_gale0
1
gravatar for Jennifer Hillman Jackson
4.2 years ago by
United States
Jennifer Hillman Jackson25k wrote:

Hi Claire,

Thanks for sending in the bug reports, helpful to examine the actual data for this one.

The dbSNP ROD file used in the analysis is lacking the full annotation required by the pipeline. In particular, lines that define the genome content are missing (triggering the empty array error you encountered). Lines such as:

##contig=<ID=1,length=249250621,assembly=b37>
##contig=<ID=10,length=135534747,assembly=b37>
##contig=<ID=11,length=135006516,assembly=b37>
 

A better choice for a dbSNP ROD dataset is the one provided with the GATK bundle. You can obtain this directly from the Broad, or when working on the public Main Galaxy instance, use the copy in the 'Shared Data -> Data Library -> GATK" bundle datasets. 

As a secondary issue, I believe it is important to use the dbSNP vcf file directly in any analysis (instead of prior processed vcf datasets, as I saw in one of your runs). If you would like to merge vcf files later, there is a tool for that: NGS: VCF Manipulation -> VCFcombine

Hopefully this helps resolve your issue, and helps guide to others that are learning GATK pipeline about the resources available on Main (http://usegalaxy.org). If our team has more to add specific to you situation, we will comment again and/or reply via email.

Take care, Jen, Galaxy team

 

ADD COMMENTlink written 4.2 years ago by Jennifer Hillman Jackson25k

Update: A full test run using the GATK bundle dbSNP dataset and your data resulted in the same failure again. It is not clear if this indicates if the job is too large to run on the public Main Galaxy instance, but that is a good possibility. I updated our Trello card for this tool to reflect the issue and our development team will be reviewing.
http://trello.com/c/plQYSCvS

Meanwhile, you can try the job out on a cloud Galaxy with more resources. This may be a good idea anyway if you have a large amount of data to process. There are also updated wrappers in the Tool Shed appropriate for cloud use, if you want to explore those. Here is how to get started:
http://usegalaxy.org/cloud
http://usegalaxy.org/toolshed

Best, Jen, Galaxy team

ADD REPLYlink written 4.2 years ago by Jennifer Hillman Jackson25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 175 users visited in the last hour