Genome ID mismatch problem on clean, straight-from-NCBI data

Question: Genome ID mismatch problem on clean, straight-from-NCBI data

4.4 years ago by

cgibas • 20

United States

cgibas • 20 wrote:

I have a local Galaxy instance running and I am attempting to use a chloroplast genome as a custom genome for a class demonstration. I am using a FASTA nucleotide file and its corresponding GFF file, directly downloaded from NCBI and unmodified. The error message I get when trying to add the GFF data to my visualization in Trackster is:

Input error: Chromosome NC_007898.3 found in your input file but not in your genome file.

However, when I examine the files they seem to be completely standard, and the identifier NC_007898.3 is used throughout both. I have viewed the help regarding Genome ID mismatches (found in another thread on BioStars) and checked the obvious. Any other suggestions?

File heads look like this:

>gi|544163592|ref|NC_007898.3| Solanum lycopersicum chloroplast, complete genome

TGGGCGAACGACGGGAATTGAACCCGCGCATGGTGGATTCACAATCCACTGCCTTGATCCACTTGGCTAC

and

##gff-version 3

#!gff-spec-version 1.20

#!processor NCBI annotwriter

##sequence-region NC_007898.3 1 155461

##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=4081

NC_007898.3 RefSeq region 1 155461 . + . ID=id0;Dbxref=taxon:4081;Is_circular=true;Name=Pltd;authority=Lycopersicon esculentum (L.);common=tomato;cultivar=LA3023;gb-synonym=Lycopersicon esculentum;gbkey=Src;genome=chloroplast;mol_type=genomic DNA;old-name=Lycopersicon esculentum;specimen-voucher=Clemson University Genomics Institute

NC_007898.3 RefSeq gene 71636 71749 . - . ID=gene0

etc.

galaxy • 1.5k views

ADD COMMENT • link •

modified 4.4 years ago by Jennifer Hillman Jackson ♦ 25k • written 4.4 years ago by cgibas • 20

4.4 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

The Custom reference genome (.fasta) needs to have identifiers that exactly match those in the reference annotation (GFF3). In your case, this would mean the .fasta dataset should be modified to have identifiers such as:

NC_007898.3

For genomes with multiple chromosomes, the processing to transform the .fasta identifier/description lines is a bit more complicated: convert to tabular, convert delimiters to tabs, convert back to fasta choosing the correct columns for the identifier and sequence content, wrap the fasta lines to a standard format length (60 is good).

But for genomes with a single chromosome, this can go quicker. Remove the first line in the .fasta file (tool is under group "Text Manipulation"). Use "Get Data -> Upload File" to paste in a single line file that contains just the new identifier line and load as a dataset.

>NC_007898.3

No extra spaces, no extra lines.

Then use "Concatenate" to place the new identifier line at the top of the sequence-only dataset that had the prior identifier line removed. Assign as .fasta datatype and all should be good to go.

You may need to convert datatypes as you process - "tabular" format works well with the manipulation tools. Just make certain you end with .fasta at the end, even if you need to assign it (use the pencil icon for a dataset to reach the 'Edit Attributes -> Datatype" metadata modifier). More is in our wiki on the "Support" page under various categories of troubleshooting, should you run into problems along the way. However, this is usually a straightforward replacement.

Hopefully this helps! Jen, Galaxy team

ADD COMMENT • link modified 4.4 years ago • written 4.4 years ago by Jennifer Hillman Jackson ♦ 25k

Thanks! It worked. Sorry for the n00b question :)

ADD REPLY • link written 4.4 years ago by cgibas • 20

Similar posts • Search »