Hi, I'm trying to use gatk on our own instance but get the error below. I have installed the b37 (hg_g1k_v37) genome and indexed and ran alignment with bwa-mem. When I try and run gatk unified genotyper I hit a problem. It looks like the MT chromosome is in the wrong order in the reference. I don't understand why my reads are not in the same order as the genome I used for alignment? Is there another version of b37 for gatk in galaxy? I used the data_manager tool to download this one. # ERROR MESSAGE: Input files reads and reference have incompatible contigs: Order of contigs differences, which is unsafe. ##### ERROR reads contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT, GL000207.1, GL000226.1, GL000229.1, GL000231.1, GL000210.1, GL000239.1, GL000235.1, GL000201.1, GL000247.1, GL000245.1, GL000197.1, GL000203.1, GL000246.1, GL000249.1, GL000196.1, GL000248.1, GL000244.1, GL000238.1, GL000202.1, GL000234.1, GL000232.1, GL000206.1, GL000240.1, GL000236.1, GL000241.1, GL000243.1, GL000242.1, GL000230.1, GL000237.1, GL000233.1, GL000204.1, GL000198.1, GL000208.1, GL000191.1, GL000227.1, GL000228.1, GL000214.1, GL000221.1, GL000209.1, GL000218.1, GL000220.1, GL000213.1, GL000211.1, GL000199.1, GL000217.1, GL000216.1, GL000215.1, GL000205.1, GL000219.1, GL000224.1, GL000223.1, GL000195.1, GL000212.1, GL000222.1, GL000200.1, GL000193.1, GL000194.1, GL000225.1, GL000192.1] ##### ERROR reference contigs = [MT, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, GL000207.1, GL000226.1, GL000229.1, GL000231.1, GL000210.1, GL000239.1, GL000235.1, GL000201.1, GL000247.1, GL000245.1, GL000197.1, GL000203.1, GL000246.1, GL000249.1, GL000196.1, GL000248.1, GL000244.1, GL000238.1, GL000202.1, GL000234.1, GL000232.1, GL000206.1, GL000240.1, GL000236.1, GL000241.1, GL000243.1, GL000242.1, GL000230.1, GL000237.1, GL000233.1, GL000204.1, GL000198.1, GL000208.1, GL000191.1, GL000227.1, GL000228.1, GL000214.1, GL000221.1, GL000209.1, GL000218.1, GL000220.1, GL000213.1, GL000211.1, GL000199.1, GL000217.1, GL000216.1, GL000215.1, GL000205.1, GL000219.1, GL000224.1, GL000223.1, GL000195.1, GL000212.1, GL000222.1, GL000200.1, GL000193.1, GL000194.1, GL000225.1, GL000192.1] ##### ERROR ------------------------------------------------------------------------------------------
Hello,
I had the very same issue recently. If you load your genome properly sorted from a history with a data manager, you have to check that the "Choose the source for the reference genome" is "As is".
Cheers,
Christ
Hello,
The first ordering of the contigs in the error report is the correct one. The second has the mito chromosome incorrectly ordered. I am not sure where you sourced the genome, but if you can get a correctly ordered one into your history (one can be obtained from the Galaxy Rsync server), it can be loaded into your local that way using a Data Manager with the option to use a reference genome "from a history".
Hopefully that alternative works for you, Jen, Galaxy team
FYI.
The rsync version of b37 was the same as mine (from 1000g) , but using importing as 'gatk' rather than 'as is' seemed to fix the problem.
Thanks,
Jane
Hi Jane, If you have time, could you share the link for the source file from 1000 Genomes that you are obtaining the fasta version of the genome? Thanks, Jen
Update: The GATK resource bundle was almost certainly not the genome source and it is known to have non-GATK ordering. Please see the other post I just added here about the new Tutorial. I tried to capture the discussion here, plus other common questions that come up around these data (and genomes in general).
I have not gone over Data mangers in detail (on purpose). The Galaxy wiki has many details for use already and enhanced tutorials to cover usage are already planned for. The primary problems people encounter have to do with the starting fasta genome - and that is what I wanted to cover. Not just for GATK, but for most similar usage when preparing data for a Custom genome/build or as input on a local/cloud for additional tool indexing. But I did use GATK in the example, since it has an additional formatting component - chromosome ordering.
Creating correct chromosome ordering for GATK when the source is not available from GATK is another topic that probably should be covered in a Tutorial. There are prior Biostar posts here where I have given those details, but a few workflows and pointers would certainly help more. Our team will work on consolidating that information into another Tutorial topic.
THANK YOU to all on this thread for sharing comments to aid other users like yourselves to navigate one of the trickier parts of Galaxy. Flexibility in data entry leads to some necessary complexities. Genomes can vary in format and content, so this flexibility is there by design to limit inherent restrictions on content. But hopefully a few more bundled, topic specific help tutorials will help guide through this process more effectively.
Once those tutorials are up, bring on the feedback and suggestions! All can be updated and all help needs cannot be predicted :) We want to help you to learn to use Galaxy above all else.
Thanks again! Jen
Hi,
The first genome we tried was from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz
I haven't opened it to check the ordering, so I'm not sure where it got messed up.
BW,
Jane
Thanks for sharing the link. It confirms that the source you used is the known version from 1000 Genomes with a non-GATK sort order. See the tutorial I included - specifically the link to the Broad Inst.'s GATK input requirements. The differences between the file above and what is included in the GATK resource bundle are explained in detail.
The other user had sorted the genome properly themselves before loading the genome with a Data manager. They made the mods, loaded genome fasta file into a history (of an Admin account), then executed the DM. And this is why the option "as-is" was appropriate for that case.
This all comes up often enough that a distinct Tutorial that gathers together the Galaxy wiki and external resources on the topic seemed worth bundling. It is the first draft. If you would like to see more covered, suggestions as replies to the post are most welcomed.
Please see: Fasta Format, Custom Genomes, and GATK Chromosome ordering