Question: GFF Error: duplicate/invalid 'transcript'
0
gravatar for bioraihan
7 months ago by
bioraihan40
bioraihan40 wrote:

Hi There, I am new to the field of bioinformatics and faced some difficulty while using galaxy for a differential analysis of common carp transcriptome. Can someone please share some light on the following matters?

1) I am using external common carp annotation file downloaded from NCBI (ftp://ftp.ncbi.nih.gov/genomes/Cyprinus_carpio//GFF/ref_common_carp_genome_top_level.gff3). I have used groomer, tophat,cufflinks on my samples and the results are satisfactory. However, while trying to run a cuffmarge it coming up with an error (GFF Error: duplicate/invalid 'transcript' feature ID=id20522).

I understand if there is any duplication in the ref file, it wont let me run the programme. So I read couple of instruction from Jan,Galaxy team and trying to look for the duplicated sequence, which is ID=id20522. I opened the ref annotation file by notepad and looked for any repeated hit with the ID. Interestingly, I got several hits. While observing closely, I found that the last digit is different and its not a repeat at all. I also used select tools from galaxy to sort these repeated sequences. I am copying the files for more details....

ID=id20522;Parent=gene1979;Dbxref=GeneID:109100662;gbkey=C_region;gene=LOC109100662;model_evidence=Supporting evidence includes similarity to: 7 Proteins%2C and 89%25 coverage of the annotated genomic feature by RNAseq alignments NC_031699.1 Gnomon exon 13449894 13449902 . + . ID=id205220;Parent=rna22760;Dbxref=GeneID:109104191,Genbank:XM_019117564.1;gbkey=mRNA;gene=LOC109104191;product=RAC-beta serine/threonine-protein kinase-like%2C transcript variant X2;transcript_id=XM_019117564.1 NC_031726.1 Gnomon exon 18147332 18147400 . + . ID=id205221;Parent=rna22760;Dbxref=GeneID:109104191,Genbank:XM_019117564.1;gbkey=mRNA;gene=LOC109104191;product=RAC-beta serine/threonine-protein kinase-like%2C transcript variant X2;transcript_id=XM_019117564.1 NC_031726.1 Gnomon exon 18147474 18147596 . + . ID=id205222;Parent=rna22760;Dbxref=GeneID:109104191,Genbank:XM_019117564.1;gbkey=mRNA;gene=LOC109104191;product=RAC-beta serine/threonine-protein kinase-like%2C transcript variant X2;transcript_id=XM_019117564.1 NC_031726.1 Gnomon exon 18148344 18148472 . + . ID=id205223;Parent=rna22760;Dbxref=GeneID:109104191,Genbank:XM_019117564.1;gbkey=mRNA;gene=LOC109104191;product=RAC-beta serine/threonine-protein kinase-like%2C transcript variant X2;transcript_id=XM_019117564.1 NC_031726.1 Gnomon exon 18148553 18148767 . + . ID=id205224;Parent=rna22760;Dbxref=GeneID:109104191,Genbank:XM_019117564.1;gbkey=mRNA;gene=LOC109104191;product=RAC-beta serine/threonine-protein kinase-like%2C transcript variant X2;transcript_id=XM_019117564.1 NC_031726.1 Gnomon exon 18150225 18150312 . + . ID=id205225;Parent=rna22760;Dbxref=GeneID:109104191,Genbank:XM_019117564.1;gbkey=mRNA;gene=LOC109104191;product=RAC-beta serine/threonine-protein kinase-like%2C transcript variant X2;transcript_id=XM_019117564.1 NC_031726.1 Gnomon exon 18150387 18150674 . + . ID=id205226;Parent=rna22760;Dbxref=GeneID:109104191,Genbank:XM_019117564.1;gbkey=mRNA;gene=LOC109104191;product=RAC-beta serine/threonine-protein kinase-like%2C transcript variant X2;transcript_id=XM_019117564.1 NC_031726.1 Gnomon CDS 18144341 18144410 . + 0

I thought I should run my cufflinks again. I did, and with new result I got a new error – GFF Error: duplicate/invalid 'transcript' feature ID=id17489 [FAILED] Error: could not execute gtf_to_sam

I looked again for the repeat and couldn't find it. Please let me know if I understand the true meaning of repeated/invalid transcripts!

2) Is there any step by step protocol to remove repeated annotation? I don't know any usage of python.

3) If my annotation is not good, where can I get another good annotation for Common carp (Cyprinus carpio)?

Sorry for the long contents. I would be very grateful if you can help me.

Regards, Raihan

rna-seq • 249 views
ADD COMMENTlink modified 7 months ago • written 7 months ago by bioraihan40
0
gravatar for Jennifer Hillman Jackson
7 months ago by
United States
Jennifer Hillman Jackson23k wrote:

Hello Raihan,

The first duplicate ID reported was "ID=id20522", but the query used with the Select tool is bringing up more matches. Maybe your regular expression query with the Select tool needs to be more specific?

Example: Using the regular expression ^ID=id20522; with the Select tool and the "Not matching" option would remove all lines with just that exact ID. Note the starting ^ and trailing ; anchors in the regex. These keep the query from being too greedy.

Reverse the query using the option "Matching" first to review all lines with this ID. This is good to test the regex. Also, one of the result lines from this query's result dataset could be added back to the "Not matching" dataset if you wanted. Tools to subset lines from datasets are in the tool group Text Manipulation (Select first line(s), Select last line(s), etc). Use the Concatenate tool to merge two or more datasets head to tail.

It seems that you have more than one duplicated ID in the dataset. To find all, isolate just the ID from the entire dataset and count up the number for each unique ID. Any with more than two occurrences will be problematic and will need the same manipulation described above (first remove all, then optionally add one per ID back in). Tools to use are:

  1. Convert delimiters to TAB with ; as the delimiter
  2. Group on the first column (c1) with the "count distinct" option.
  3. Filter to find the IDs in the dataset present two or more times with the expression: c2!='1'

This could be quite a bit of manipulation to do and there will be content loss, but it is an option. I do not know of an alternative GFF3 source for this genome, but perhaps someone else from the community does and will add in another reply or comment.

Hope this helps! Jen, Galaxy team

ADD COMMENTlink modified 7 months ago • written 7 months ago by Jennifer Hillman Jackson23k
0
gravatar for bioraihan
7 months ago by
bioraihan40
bioraihan40 wrote:

Hi Jen, Thanks a lot for your prompt replay. As I said I am new to this field and never used the tools mentioned above, however, I am trying to look for the manuals and learn things.

I still have one query, as I said, although the front digits of the IDs are similar to each other, the last digit is different and that should made them distinct from each other and cuffmarge should not count them as same! Did I miss something?

Thanks for hour helps and time.

Regards, Raihan

ADD COMMENTlink written 7 months ago by bioraihan40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 81 users visited in the last hour