Question: RNA genes count using HTseq
gravatar for cjain
15 months ago by
cjain0 wrote:


I am new to Galaxy, but based on my understanding of how things work, I mapped some RNA-seq data from E. coli K-12 onto a reference genome obtained via UCSC using Bowtie, and then ran that through HTSeq to get a count of reads for the different genes. All of that worked well, but what was missing from the reads corresponding to RNA genes, such as rRNA, tRNA and regulatory RNAs. Is there a way to perform the analysis and get counts for RNA genes as well? This is critical for our analysis, as most of the hits we are expecting from our experiment should correspond to RNA genes.

Also it was not clear which version of the E. coli genome sequence I have downloaded from the UCSC site. There are many K-12 genomes available now, and I'd like to be sure that the version I am using corresponds to the strain we are using (which is actually the first K-12 genome to be sequenced). Is there any way to get information on the K-12 genome (e.g., Genbank or Refseq ID) that is available on the UCSC site?

Other than that, I am really pleased with Galaxy because it has given beginners such as myself the power to manipulate HTS data. Keep the good work going.


Chaitanya Jain Associate Professor, University of Miami

ADD COMMENTlink modified 15 months ago by Jennifer Hillman Jackson24k • written 15 months ago by cjain0
gravatar for Jennifer Hillman Jackson
15 months ago by
United States
Jennifer Hillman Jackson24k wrote:

Hello Chaitanya,

How to select specific features from the reference annotation to summarize counts

  • Use an annotation dataset contains the targetted RNA types. Ensure it is based the same exact genome build/source as the reference genome used for other steps. This includes matching chromosome identifier(s).
  • Adjust the tool form values for the options Feature type and ID Attribute to match the contents in the GFF file for the RNA types of interest. This may take a few runs to capture counts for all of your target RNAs, depending on how the GFF file organizes the feature attribute.

How to determine the exact source genome build for the E. coli K-12 genome at UCSC (and any other genome hosted at a UCSC genome browser site)

  • Review the home page for the genome browser for this build at UCSC. For your case, examine For others reading, your target genome's browser may be hosted at ).
  • On the page itself, or near the bottom of it, credits and sources are listed. Some genomes home pages include details directly, others have link-outs to a distinct Credits webpage.
  • If the information is incomplete or not specific enough at, contact the team that hosts the site. Contact information is on the same home page. For you case, the information is complete and the NCBI RefSeq Accession is listed as AC_000091 (
  • If the information is incomplete from, next check the build on the Downloads page at UCSC. Often more details, including the exact source, are included under one or both of these directories: Full data set or Data set by chromosome. Contact UCSC if any of that info is unclear, not specific enough, or missing (very rare).

Once you obtain the fasta sequence of the build, do a double check against your local fasta of the strain to ensure a match. If it is not a match, you can always use the Custom Reference genome/build functionality with your own fasta from the history with tools in Galaxy. How to:

Hopefully this helps! Jen, Galaxy team

ADD COMMENTlink written 15 months ago by Jennifer Hillman Jackson24k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 95 users visited in the last hour