HT-seq count does not recognize "gff" annotation dataset == assign GTF and consider an alternate annotation source

Question: HT-seq count does not recognize "gff" annotation dataset == assign GTF and consider an alternate annotation source

5 months ago by

drumarsohail • 10

drumarsohail • 10 wrote:

Hi all

I have been using galaxy tutorial for RNA-Seq of Drosophila malanogaster.

I have done bwa mem, tophat2, and filter and sort steps.

I have imported SFT file from UCSC and filtered it using c7 != "."

Now I want to run htseq-count by selecting tophat accepted_hits vs Filter on data file

But htseq-count doesnt give me Filter on data file in GFF file option.

See image below

https://usegalaxy.org/?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Flparsons%2Fhtseq_count%2Fhtseq_count%2F0.9.1&version=0.9.1&__identifer=n743qtki6g8

gff annotation gtf galaxy htseq • 275 views

ADD COMMENT • link •

modified 5 months ago • written 5 months ago by drumarsohail • 10

5 months ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hi,

I renamed the subject to reflect the question better (to link it to prior Q&A).

If the annotation data source is from the tool Get Data > UCSC main with the output in the Table Browser set to gtf, then that annotation will technically work. Just be aware that counts will be summarized by transcript, not gene. Why? UCSC data extracted with this method returns annotation with both gene_id and transcript_id to the same value -- "transcript". This is how the UCSC Table Browser works.

The datatype gff would not be assigned when retrieved by this method/source when directly importing to Galaxy with the UCSC TB.

Perhaps the data was loaded using another method? You can either accept that as a usable summary for your counts and reassign the datatype gtf, if it matches that datatype specification/mapping database genome build/version, or choose another annotation source with gene_id summarized (strongly recommended).

Wherever you choose to obtain the annotation check it versus the genome used for mapping. You may need to assign the correct database to BAM inputs due to a small bug fixed last week.

The FAQs here explain in more details: https://galaxyproject.org/support/#troubleshooting. Plus you can review prior Q&A at the right sidebar (or search for the term "htseq") to see how others have resolved annotation/database/datatype data conflicts within the context of their RNA-seq analysis.

Tophat is considered deprecated, with HISAT2 as the replacement. Please see the Galaxy RNA-seq tutorials for example workflows: https://galaxyproject.org/learn/

Thanks! Jen, Galaxy team

ADD COMMENT • link modified 5 months ago • written 5 months ago by Jennifer Hillman Jackson ♦ 25k

5 months ago by

drumarsohail • 10

drumarsohail • 10 wrote:

Hi Jen

Thanks It helped. I didn't define UCSC output file formate as GTF . Therefore galaxy couldn't recognize it. It was really very nice written tutorial. Thanks again.

ADD COMMENT • link written 5 months ago by drumarsohail • 10

Please log in to add an answer.

Similar posts • Search »