Hi There, I am new to the field of bioinformatics and faced some difficulty while using galaxy for a differential analysis of common carp transcriptome. Can someone please share some light on the following matters?
1) I am using external common carp annotation file downloaded from NCBI (ftp://ftp.ncbi.nih.gov/genomes/Cyprinus_carpio//GFF/ref_common_carp_genome_top_level.gff3). I have used groomer, tophat,cufflinks on my samples and the results are satisfactory. However, while trying to run a cuffmarge it coming up with an error (GFF Error: duplicate/invalid 'transcript' feature ID=id20522).
I understand if there is any duplication in the ref file, it wont let me run the programme. So I read couple of instruction from Jan,Galaxy team and trying to look for the duplicated sequence, which is ID=id20522. I opened the ref annotation file by notepad and looked for any repeated hit with the ID. Interestingly, I got several hits. While observing closely, I found that the last digit is different and its not a repeat at all. I also used select tools from galaxy to sort these repeated sequences. I am copying the files for more details....
ID=id20522;Parent=gene1979;Dbxref=GeneID:109100662;gbkey=C_region;gene=LOC109100662;model_evidence=Supporting evidence includes similarity to: 7 Proteins%2C and 89%25 coverage of the annotated genomic feature by RNAseq alignments NC_031699.1 Gnomon exon 13449894 13449902 . + . ID=id205220;Parent=rna22760;Dbxref=GeneID:109104191,Genbank:XM_019117564.1;gbkey=mRNA;gene=LOC109104191;product=RAC-beta serine/threonine-protein kinase-like%2C transcript variant X2;transcript_id=XM_019117564.1 NC_031726.1 Gnomon exon 18147332 18147400 . + . ID=id205221;Parent=rna22760;Dbxref=GeneID:109104191,Genbank:XM_019117564.1;gbkey=mRNA;gene=LOC109104191;product=RAC-beta serine/threonine-protein kinase-like%2C transcript variant X2;transcript_id=XM_019117564.1 NC_031726.1 Gnomon exon 18147474 18147596 . + . ID=id205222;Parent=rna22760;Dbxref=GeneID:109104191,Genbank:XM_019117564.1;gbkey=mRNA;gene=LOC109104191;product=RAC-beta serine/threonine-protein kinase-like%2C transcript variant X2;transcript_id=XM_019117564.1 NC_031726.1 Gnomon exon 18148344 18148472 . + . ID=id205223;Parent=rna22760;Dbxref=GeneID:109104191,Genbank:XM_019117564.1;gbkey=mRNA;gene=LOC109104191;product=RAC-beta serine/threonine-protein kinase-like%2C transcript variant X2;transcript_id=XM_019117564.1 NC_031726.1 Gnomon exon 18148553 18148767 . + . ID=id205224;Parent=rna22760;Dbxref=GeneID:109104191,Genbank:XM_019117564.1;gbkey=mRNA;gene=LOC109104191;product=RAC-beta serine/threonine-protein kinase-like%2C transcript variant X2;transcript_id=XM_019117564.1 NC_031726.1 Gnomon exon 18150225 18150312 . + . ID=id205225;Parent=rna22760;Dbxref=GeneID:109104191,Genbank:XM_019117564.1;gbkey=mRNA;gene=LOC109104191;product=RAC-beta serine/threonine-protein kinase-like%2C transcript variant X2;transcript_id=XM_019117564.1 NC_031726.1 Gnomon exon 18150387 18150674 . + . ID=id205226;Parent=rna22760;Dbxref=GeneID:109104191,Genbank:XM_019117564.1;gbkey=mRNA;gene=LOC109104191;product=RAC-beta serine/threonine-protein kinase-like%2C transcript variant X2;transcript_id=XM_019117564.1 NC_031726.1 Gnomon CDS 18144341 18144410 . + 0
I thought I should run my cufflinks again. I did, and with new result I got a new error – GFF Error: duplicate/invalid 'transcript' feature ID=id17489 [FAILED] Error: could not execute gtf_to_sam
I looked again for the repeat and couldn't find it. Please let me know if I understand the true meaning of repeated/invalid transcripts!
2) Is there any step by step protocol to remove repeated annotation? I don't know any usage of python.
3) If my annotation is not good, where can I get another good annotation for Common carp (Cyprinus carpio)?
Sorry for the long contents. I would be very grateful if you can help me.
Regards, Raihan