Question: contig order b37
0
gravatar for jane19
2.7 years ago by
jane190
United Kingdom
jane190 wrote:
Hi,

I'm trying to use gatk on our own instance but get the error below. I have installed the b37 (hg_g1k_v37) genome and indexed and ran alignment with bwa-mem. When I try and run gatk unified genotyper I hit a problem.

It looks like the MT chromosome is in the wrong order in the reference. I don't understand why my reads are not in the same order as the genome I used for alignment?

Is there another version of b37 for gatk in galaxy? I used the data_manager tool to download this one.


# ERROR MESSAGE: Input files reads and reference have incompatible contigs: Order of contigs differences, which is unsafe.
##### ERROR   reads contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT, GL000207.1, GL000226.1, GL000229.1, GL000231.1, GL000210.1, GL000239.1, GL000235.1, GL000201.1, GL000247.1, GL000245.1, GL000197.1, GL000203.1, GL000246.1, GL000249.1, GL000196.1, GL000248.1, GL000244.1, GL000238.1, GL000202.1, GL000234.1, GL000232.1, GL000206.1, GL000240.1, GL000236.1, GL000241.1, GL000243.1, GL000242.1, GL000230.1, GL000237.1, GL000233.1, GL000204.1, GL000198.1, GL000208.1, GL000191.1, GL000227.1, GL000228.1, GL000214.1, GL000221.1, GL000209.1, GL000218.1, GL000220.1, GL000213.1, GL000211.1, GL000199.1, GL000217.1, GL000216.1, GL000215.1, GL000205.1, GL000219.1, GL000224.1, GL000223.1, GL000195.1, GL000212.1, GL000222.1, GL000200.1, GL000193.1, GL000194.1, GL000225.1, GL000192.1]
##### ERROR   reference contigs = [MT, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, GL000207.1, GL000226.1, GL000229.1, GL000231.1, GL000210.1, GL000239.1, GL000235.1, GL000201.1, GL000247.1, GL000245.1, GL000197.1, GL000203.1, GL000246.1, GL000249.1, GL000196.1, GL000248.1, GL000244.1, GL000238.1, GL000202.1, GL000234.1, GL000232.1, GL000206.1, GL000240.1, GL000236.1, GL000241.1, GL000243.1, GL000242.1, GL000230.1, GL000237.1, GL000233.1, GL000204.1, GL000198.1, GL000208.1, GL000191.1, GL000227.1, GL000228.1, GL000214.1, GL000221.1, GL000209.1, GL000218.1, GL000220.1, GL000213.1, GL000211.1, GL000199.1, GL000217.1, GL000216.1, GL000215.1, GL000205.1, GL000219.1, GL000224.1, GL000223.1, GL000195.1, GL000212.1, GL000222.1, GL000200.1, GL000193.1, GL000194.1, GL000225.1, GL000192.1]
##### ERROR ------------------------------------------------------------------------------------------
bwa alignment gatk • 1.0k views
ADD COMMENTlink modified 2.7 years ago by Jennifer Hillman Jackson25k • written 2.7 years ago by jane190
2
gravatar for christophe.habib
2.7 years ago by
France
christophe.habib340 wrote:

Hello,

I had the very same issue recently. If you load your genome properly sorted from a history with a data manager, you have to check that the "Choose the source for the reference genome" is "As is".

Cheers,

Christ

ADD COMMENTlink written 2.7 years ago by christophe.habib340
1
gravatar for Jennifer Hillman Jackson
2.7 years ago by
United States
Jennifer Hillman Jackson25k wrote:

Hello,

The first ordering of the contigs in the error report is the correct one. The second has the mito chromosome incorrectly ordered. I am not sure where you sourced the genome, but if you can get a correctly ordered one into your history (one can be obtained from the Galaxy Rsync server), it can be loaded into your local that way using a Data Manager with the option to use a reference genome "from a history". 

Hopefully that alternative works for you, Jen, Galaxy team

 

ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by Jennifer Hillman Jackson25k
0
gravatar for jane19
2.7 years ago by
jane190
United Kingdom
jane190 wrote:

FYI.

The rsync version of b37 was the same as mine (from 1000g) , but using importing as 'gatk' rather than 'as is' seemed to fix the problem.

Thanks,

Jane

ADD COMMENTlink written 2.7 years ago by jane190

Hi Jane, If you have time, could you share the link for the source file from 1000 Genomes that you are obtaining the fasta version of the genome? Thanks, Jen

ADD REPLYlink written 2.7 years ago by Jennifer Hillman Jackson25k

Update: The GATK resource bundle was almost certainly not the genome source and it is known to have non-GATK ordering. Please see the other post I just added here about the new Tutorial. I tried to capture the discussion here, plus other common questions that come up around these data (and genomes in general).

I have not gone over Data mangers in detail (on purpose). The Galaxy wiki has many details for use already and enhanced tutorials to cover usage are already planned for. The primary problems people encounter have to do with the starting fasta genome - and that is what I wanted to cover. Not just for GATK, but for most similar usage when preparing data for a Custom genome/build or as input on a local/cloud for additional tool indexing. But I did use GATK in the example, since it has an additional formatting component - chromosome ordering. 

Creating correct chromosome ordering for GATK when the source is not available from GATK is another topic that probably should be covered in a Tutorial. There are prior Biostar posts here where I have given those details, but a few workflows and pointers would certainly help more. Our team will work on consolidating that information into another Tutorial topic.

THANK YOU to all on this thread for sharing comments to aid other users like yourselves to navigate one of the trickier parts of Galaxy. Flexibility in data entry leads to some necessary complexities. Genomes can vary in format and content, so this flexibility is there by design to limit inherent restrictions on content. But hopefully a few more bundled, topic specific help tutorials will help guide through this process more effectively.

Once those tutorials are up, bring on the feedback and suggestions! All can be updated and all help needs cannot be predicted :) We want to help you to learn to use Galaxy above all else.

Thanks again! Jen

ADD REPLYlink written 2.7 years ago by Jennifer Hillman Jackson25k

Hi,

The first genome we tried was from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz

I haven't opened it to check the ordering, so I'm not sure where it got messed up.

BW,

Jane

ADD REPLYlink written 2.7 years ago by jane190

Thanks for sharing the link. It confirms that the source you used is the known version from 1000 Genomes with a non-GATK sort order. See the tutorial I included - specifically the link to the Broad Inst.'s GATK input requirements. The differences between the file above and what is included in the GATK resource bundle are explained in detail.

The other user had sorted the genome properly themselves before loading the genome with a Data manager. They made the mods, loaded genome fasta file into a history (of an Admin account), then executed the DM. And this is why the option "as-is" was appropriate for that case.

ADD REPLYlink written 2.7 years ago by Jennifer Hillman Jackson25k
0
gravatar for Jennifer Hillman Jackson
2.7 years ago by
United States
Jennifer Hillman Jackson25k wrote:

This all comes up often enough that a distinct Tutorial that gathers together the Galaxy wiki and external resources on the topic seemed worth bundling. It is the first draft. If you would like to see more covered, suggestions as replies to the post are most welcomed.

Please see: Fasta Format, Custom Genomes, and GATK Chromosome ordering

ADD COMMENTlink written 2.7 years ago by Jennifer Hillman Jackson25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 113 users visited in the last hour