RNA genes count using HTseq

Question: RNA genes count using HTseq

2.0 years ago by

cjain • 10

cjain • 10 wrote:

Hello:

I am new to Galaxy, but based on my understanding of how things work, I mapped some RNA-seq data from E. coli K-12 onto a reference genome obtained via UCSC using Bowtie, and then ran that through HTSeq to get a count of reads for the different genes. All of that worked well, but what was missing from the reads corresponding to RNA genes, such as rRNA, tRNA and regulatory RNAs. Is there a way to perform the analysis and get counts for RNA genes as well? This is critical for our analysis, as most of the hits we are expecting from our experiment should correspond to RNA genes.

Also it was not clear which version of the E. coli genome sequence I have downloaded from the UCSC site. There are many K-12 genomes available now, and I'd like to be sure that the version I am using corresponds to the strain we are using (which is actually the first K-12 genome to be sequenced). Is there any way to get information on the K-12 genome (e.g., Genbank or Refseq ID) that is available on the UCSC site?

Other than that, I am really pleased with Galaxy because it has given beginners such as myself the power to manipulate HTS data. Keep the good work going.

Sincerely,

Chaitanya Jain Associate Professor, University of Miami

reference-genome reference-annotation htseq_count • 764 views

ADD COMMENT • link •

modified 2.0 years ago by Jennifer Hillman Jackson ♦ 25k • written 2.0 years ago by cjain • 10

2.0 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello Chaitanya,

How to select specific features from the reference annotation to summarize counts

Use an annotation dataset contains the targetted RNA types. Ensure it is based the same exact genome build/source as the reference genome used for other steps. This includes matching chromosome identifier(s).
Adjust the tool form values for the options Feature type and ID Attribute to match the contents in the GFF file for the RNA types of interest. This may take a few runs to capture counts for all of your target RNAs, depending on how the GFF file organizes the feature attribute.

How to determine the exact source genome build for the E. coli K-12 genome at UCSC (and any other genome hosted at a UCSC genome browser site)

Review the home page for the genome browser for this build at UCSC. For your case, examine http://archaea.ucsc.edu/. For others reading, your target genome's browser may be hosted at http://genome.ucsc.edu ).
On the page itself, or near the bottom of it, credits and sources are listed. Some genomes home pages include details directly, others have link-outs to a distinct Credits webpage.
If the information is incomplete or not specific enough at http://archaea.ucsc.edu/, contact the team that hosts the site. Contact information is on the same home page. For you case, the information is complete and the NCBI RefSeq Accession is listed as AC_000091 (http://archaea.ucsc.edu/cgi-bin/hgGateway?db=eschColi_K_12_SUBSTR_W3110).
If the information is incomplete from http://genome.ucsc.edu, next check the build on the Downloads page at UCSC. Often more details, including the exact source, are included under one or both of these directories: Full data set or Data set by chromosome. Contact UCSC if any of that info is unclear, not specific enough, or missing (very rare).

Once you obtain the fasta sequence of the build, do a double check against your local fasta of the strain to ensure a match. If it is not a match, you can always use the Custom Reference genome/build functionality with your own fasta from the history with tools in Galaxy. How to: https://wiki.galaxyproject.org/Support#Custom_reference_genome

Hopefully this helps! Jen, Galaxy team

ADD COMMENT • link written 2.0 years ago by Jennifer Hillman Jackson ♦ 25k

Similar posts • Search »