Running Cufflinks On Bacterial Rnaseq Data

Question: Running Cufflinks On Bacterial Rnaseq Data

6.3 years ago by

I am attempting to run Cufflinks on Galaxy main to analyze my E. coli RNAseq data. I have mapped my reads using an outside program (Genious) and uploaded the resulting BAM file. I also have uploaded the E. coli annotations as a gtf file. However when I attempt to run Cufflinks using my annotations it just stays on "Job is waiting to run" for hours. If I click on "Edit attributes", I see an error message "Required metadata values are missing". Does this mean that my files are somehow incomplete and cufflinks will never run, or do I just need to wait longer? When searching around the mailing lists I saw others have had issues with bacteria due to its circular chromosome, and was wondering if this might somehow be related. Thanks. Rachel

rna-seq cufflinks • 1.8k views

ADD COMMENT • link •

modified 6.3 years ago by Jennifer Hillman Jackson ♦ 25k • written 6.3 years ago by Rachel Krasich • 10

6.3 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello Rachel, When datasets are in a grey "waiting to run" state this indicates that they are in the queue and in line to run. For the majority of cases, including yours, leaving the job alone and allowing it to run is the correct option. The missing metadata only means that the result has not yet posted to your history (expected when still grey). It looks as if your jobs have now run, but resulted in errors. I can let you know that the problem is with the input GFF3 dataset. It contains at least one duplicated "ID" attribute, which is required to be unique within GFF3 files. Clicking on the green bug icon in any of the red error datasets will point to the example duplicated ID. To my knowledge, the content being based on a bacterial genome is unrelated to this format problem. For reference, this is the specification help for GFF3: http://wiki.g2.bx.psu.edu/Learn/Datatypes#GFF3 This can be a difficult problem to resolve on your own since the scope of the true file issues are unknown. Locating an alternate source or contacting the original source of this GFF3 dataset to request a correction would be potential solutions. The tophat.cufflinks@gmail.com mailing list or seqanswers.com are suggested places to query for reference annotation file recommendations. Best, Jen Galaxy team -- Jennifer Jackson http://galaxyproject.org

ADD COMMENT • link written 6.3 years ago by Jennifer Hillman Jackson ♦ 25k

Peter is correct (I oversimplified)! And Cufflinks does allow for an ID attribute to span lines as long as it represents the same feature. To be clear, this error was a true format issue. The best way to understand the finer points is to see the specification (also linked from wiki below): http://www.sequenceontology.org/gff3.shtml (quote) Column 9: "attributes" <...> ID Indicates the ID of the feature. IDs for each feature must be unique within the scope of the GFF file. In the case of discontinuous features (i.e. a single feature that exists over multiple genomic locations) the same ID may appear on multiple lines. All lines that share an ID collectively represent a single feature. Thanks Peter for the clarification! Jen Galaxy team -- Jennifer Jackson http://galaxyproject.org

ADD REPLY • link written 6.3 years ago by Jennifer Hillman Jackson ♦ 25k

Actually that isn't quite right (although it may be a limitation imposed by some tools using GFF3 as an input). Features split over multiple locations are described in GFF3 using multiple lines sharing the same ID attribute. This is most commonly used for genes made up of multiple exons, but can even apply across references in some extreme trans-splicing cases. Peter

ADD REPLY • link written 6.3 years ago by Peter Cock • 1.4k

Similar posts • Search »