Cufflinks duplicate GFF ID error - best course of action?

Question: Cufflinks duplicate GFF ID error - best course of action?

4.3 years ago by

Canada

Suzanne Gomes • 120 wrote:

Hi, I have a question regarding the best way to handle genome annotations that include duplicate GFF IDs. I originally tried running my data and encountered this problem with the D. pseudoobscura flybase annotation. I realized there was a ton of stuff from different sources in the annotation, so filtered to include only lines with 'FlyBase' and 'gene'. This ran just fine through the rest of the Tuxedo pipeline. However, recently I realized that this results in an annotation with only the whole genes, but no intron/exon structure. Adding the intron/exon lines back into the annotation produces the duplicate GFF error.

So my question is - is it worth re-running with the exon/intron structure added back in somehow? It sounds like it is possible to work around the GFF ID error, as mentioned here (Cufflinks error when trying to align against genome) though I'm not sure how hard that would be to do. And if I did, should I still include the whole genes, or only the introns/exons (or only the exons?).

Looking at my data in trackster, it seems like cufflinks has done a pretty good job of finding exon/intron boundaries (that match well with those on flybase) all on it's own. But it sometimes has the same gene listed under two different geneIDs in the output (with non-zero FPKMs for both) - one which is the gene with introns, one without (spanning the whole gene). So if it's splitting the reads mapping to two different entries, I worry that that might affect my ability to call differential expression in some cases (especially for genes with few reads mapping already).

Anyone else had this problem? If so, how did you address it?

Thanks

Suzanne

rna-seq tool flybase cufflinks inputs • 2.4k views

ADD COMMENT • link •

modified 3.8 years ago • written 4.3 years ago by Suzanne Gomes • 120

4.3 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Suzanne,

Glad you were able to locate the original reply - those general instructions still are the best path when encountering duplicated GFF IDs, especially for genomes with iGenomes annotation. Unfortunately, D. pseudoobscura does not. Both gene and transcript-level reference annotation provides benefits, but if it is not available in the proper format for the tool, it cannot be used as-is.

If I understood the explanation from the FlyBase team correctly, what appears to be duplicated genes are in fact valid and distinct genomic features. That this causes a conflict with the Tuxedo suite is a known to them, but duplicating GFF IDs is the proper way to model the data scientifically. As far as I know, reads splitting between features labeled under the same GFF ID is expected.

You could modify the GFF IDs to allow the tools to accept the annotation file in full, but I still think that the FlyBase team would be the best direct source for recommendations. That said, others are still welcome to post comments/experiences! And if you wish to post any feedback you receive from them, and your solution, that would almost certainly aid others.

Best, Jen, Galaxy team

ADD COMMENT • link written 4.3 years ago by Jennifer Hillman Jackson ♦ 25k

4.1 years ago by

jogoodma • 10

jogoodma • 10 wrote:

Hi Suzanne,

FlyBase started distributing GTF formatted files of our genome data in July for our FB2014_04 release. This format tends to work much better with cufflinks. Can you give that a try and let us know?

ftp://ftp.flybase.org/genomes/dpse/current/gtf

Cheers,

Josh

FlyBase

ADD COMMENT • link written 4.1 years ago by jogoodma • 10

Thanks Josh! I will add this information to our RNA-seq wiki help. Jen, Galaxy team

ADD REPLY • link written 4.1 years ago by Jennifer Hillman Jackson ♦ 25k

3.8 years ago by

Suzanne Gomes • 120

Canada

Suzanne Gomes • 120 wrote:

Hi Josh,

I tried your link above, but I just get a 'this webpage is not available' error. I also looked on the website, but all I can find is the GFF formatted files. I was also wondering whether you guys have a GTF version of the D. willistoni annotation available?

Thanks

Suzanne

ADD COMMENT • link written 3.8 years ago by Suzanne Gomes • 120

Please log in to add an answer.

Similar posts • Search »