Question: Genome ID mismatch problem on clean, straight-from-NCBI data
0
gravatar for cgibas
4.4 years ago by
cgibas20
United States
cgibas20 wrote:

I have a local Galaxy instance running and I am attempting to use a chloroplast genome as a custom genome for a class demonstration. I am using a FASTA nucleotide file and its corresponding GFF file, directly downloaded from NCBI and unmodified. The error message I get when trying to add the GFF data to my visualization in Trackster is:

Input error: Chromosome NC_007898.3 found in your input file but not in your genome file.

However, when I examine the files they seem to be completely standard, and the identifier NC_007898.3 is used throughout both. I have viewed the help regarding Genome ID mismatches (found in another thread on BioStars) and checked the obvious. Any other suggestions? 

 

File heads look like this:

>gi|544163592|ref|NC_007898.3| Solanum lycopersicum chloroplast, complete genome

TGGGCGAACGACGGGAATTGAACCCGCGCATGGTGGATTCACAATCCACTGCCTTGATCCACTTGGCTAC

and 

##gff-version 3

#!gff-spec-version 1.20

#!processor NCBI annotwriter

##sequence-region NC_007898.3 1 155461

##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=4081

NC_007898.3    RefSeq    region    1    155461    .    +    .    ID=id0;Dbxref=taxon:4081;Is_circular=true;Name=Pltd;authority=Lycopersicon esculentum (L.);common=tomato;cultivar=LA3023;gb-synonym=Lycopersicon esculentum;gbkey=Src;genome=chloroplast;mol_type=genomic DNA;old-name=Lycopersicon esculentum;specimen-voucher=Clemson University Genomics Institute

NC_007898.3    RefSeq    gene    71636    71749    .    -    .    ID=gene0

etc.

galaxy • 1.5k views
ADD COMMENTlink modified 4.4 years ago by Jennifer Hillman Jackson25k • written 4.4 years ago by cgibas20
0
gravatar for Jennifer Hillman Jackson
4.4 years ago by
United States
Jennifer Hillman Jackson25k wrote:

Hello,

The Custom reference genome (.fasta) needs to have identifiers that exactly match those in the reference annotation (GFF3). In your case, this would mean the .fasta dataset should be modified to have identifiers such as:

NC_007898.3

For genomes with multiple chromosomes, the processing to transform the .fasta identifier/description lines is a bit more complicated: convert to tabular, convert delimiters to tabs, convert back to fasta choosing the correct columns for the identifier and sequence content, wrap the fasta lines to a standard format length (60 is good).

But for genomes with a single chromosome, this can go quicker. Remove the first line in the .fasta file (tool is under group "Text Manipulation"). Use "Get Data -> Upload File" to paste in a single line file that contains just the new identifier line and load as a dataset.

>NC_007898.3

No extra spaces, no extra lines.

Then use "Concatenate" to place the new identifier line at the top of the sequence-only dataset that had the prior identifier line removed. Assign as .fasta datatype and all should be good to go.

You may need to convert datatypes as you process - "tabular" format works well with the manipulation tools. Just make certain you end with .fasta at the end, even if you need to assign it (use the pencil icon for a dataset to reach the 'Edit Attributes -> Datatype" metadata modifier). More is in our wiki on the "Support" page under various categories of troubleshooting, should you run into problems along the way. However, this is usually a straightforward replacement.

Hopefully this helps! Jen, Galaxy team

 

 

ADD COMMENTlink modified 4.4 years ago • written 4.4 years ago by Jennifer Hillman Jackson25k

Thanks! It worked. Sorry for the n00b question :)

ADD REPLYlink written 4.4 years ago by cgibas20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 169 users visited in the last hour