Problem With Gatk Tools Not Accepting Bam File As Input

Question: Problem With Gatk Tools Not Accepting Bam File As Input

5.4 years ago by

I am using GATK tools on the useGalaxy main server to detect variants in a mutant C. elegans whole genome sequence obtained with an Illumina instrument (my own data). The first GATK tool I tried to use, Realigner Target Creator, gave me an error message. In the tool window, my input file (a BAM file previously run through Add or Replace Groups) did not generate an error, but the reference genome file (ce10) which I specified as found in History, produced the following reference list-specific error: "History does not include a dataset of the required format/build". I got the same error when I tried to use this input file to run the GATK Depth of Coverage tool. I have searched Galaxy mail archives for this error, and have found other examples, but none involving these tools. The ce10 database was listed in the History attributes of the BAM file I used, and this database has worked with all of the Galaxy tools I used up to this point. Something about the ce10 format is unacceptable to GATK, or it is not even picking it up from the History. I don't know how to access ce10 to check its format. I have only found the inbuilt reference genome files in Galaxy in drop-down menus for each tool. Searching the GATK site for solutions has not been helpful, because they suggest GATK-specific functions to fix the format such as Create Sequence Dictionary. I don't have access to these tools within the Galaxy main server. Can someone suggest a workaround or a direct solution?

galaxy • 1.3k views

ADD COMMENT • link •

modified 5.4 years ago by Jennifer Hillman Jackson ♦ 25k • written 5.4 years ago by Politz, Samuel M. • 70

5.4 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello, GATK requires that reference genomes are sorted in a specific way. For certain genomes, the chromosomes included in the build are also restricted. This is often different that how most are released in "full" format (with random, haplotype, and/or unmapped data) and sometimes required to be used by other tools or simply how they have been already used, making a change at this point an issue for backwards-compatibility. This is where using a genome from the history (on the public Main server, but only for small genomes) or a cloud or local Galaxy fits in with GATK. This sort/build information can be found on the GATK web site and formatting the data can be done prior to upload into Galaxy, or converting to fasta->tabular and a combination of filters/sorting can be done to subset and order the data (each genome is a bit different, so there is no single method). But, for ce10 this has already been done. You can import a GATK- friendly version of the genome from one of the Cloudmap publication's histories (Shared Data -> Published Pages -> CloudMap), as it also uses ce10. See this link for a history that you can import. Dataset #5 is the ce10 reference genome. https://main.g2.bx.psu.edu/u/gm2123/h/cloudmapot266proofofprinciple The publication may also give you ideas about how to format inputs for these tools. The ce10 reference genome can also be a model for how to sort other genomes (sometimes it takes a few tries to get the right ordering). If you are switching genomes, you may need to start over from mapping. Some help about how to determine if that is needed is in our wiki here: http://wiki.galaxyproject.org/Support#Reference_genomes Hopefully this helps, Jen Galaxy team -- Jennifer Hillman-Jackson Galaxy Support and Training http://galaxyproject.org

ADD COMMENT • link written 5.4 years ago by Jennifer Hillman Jackson ♦ 25k

Similar posts • Search »