All reads are 0 in htseq-count

Question: All reads are 0 in htseq-count

2.2 years ago by

Hello,

I have a trouble using htseq-count, I used TopHat to aling my data and it looks ok but when I tried to quantify the number of reads all the genes present 0 read.

You know somethign that I could change to obtain the number of reads?

Thanks!!

Juan.

troubleshooting genome custom-genomes htseq-count • 1.7k views

ADD COMMENT • link •

modified 2.2 years ago • written 2.2 years ago by juan.perianez • 20

Do the chromosome names in your GTF and BAM file match? That's the most frequent cause of this.

ADD REPLY • link written 2.2 years ago by Devon Ryan • 1.9k

Agree with Devon. Try running BAM-to-SAM on the aligned data outputting just the SAM header and compare the chromosome identifiers between all inputs. More: https://wiki.galaxyproject.org/Support#Reference_genomes

Jen, Galaxy team

ADD REPLY • link written 2.2 years ago by Jennifer Hillman Jackson ♦ 25k

Hi,

Thanks for your help, I do not know really how to check this, It is the first time that I use Galaxy. I took the GTF file from the Arabidopsis_thaliana_Ensembl_TAIR10.tar and it present the format that I show below:

Seqname Source Feature Start End Score Strand Frame Attributes 1 ensembl UTR 3631 3759 . + . gene_biotype "protein_coding"; gene_id "AT1G01010"; gene_name "NAC001"; gene_source "ensembl"; gene_version "1"; p_id "P20332"; transcript_biotype "protein_coding"; transcript_id "AT1G01010.1"; transcript_name "ANAC001"; transcript_source "ensembl"; transcript_version "1"; tss_id "TSS22525";

The BAM file is the output of th TopHat program in Galaxy.

Is it the problem?

Thanks!

ADD REPLY • link written 2.2 years ago by juan.perianez • 20

Did you try the help in the wiki? This is the detailed "how-to-check" linked from the general help shared in my first comment: https://wiki.galaxyproject.org/Support/ChromIdentifiers

Learning how to do this will avoid many headaches in the future. Input mismatch problems are very common and can be avoided. Data with mismatches will never process correctly to produce valid results, whether the tool errors or not.

I am also going into your account to check. Will write back when done with exactly how I checked your particular data as a reply.

ADD REPLY • link modified 2.2 years ago • written 2.2 years ago by Jennifer Hillman Jackson ♦ 25k

2.2 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

I checked your history with the analysis. There is a reference genome mismatch problem between the natively indexed TAIR10 genome at Galaxy Main (http://usegalaxy.org) and the GTF reference annotation from iGenomes.

Good news!! You are very close to obtaining a successful run and avoiding issues with other tools. The genome.fa from the same annotation bundle from iGenomes is already in your history! To solve the current problem and later potential issues, do the following:

Re-map the reads again the TAIR10 *genome.fa genome from the bundle*. This will ensure that all data going forward is based on the same exact reference genome with identical chromosome identifiers. This is very important to obtain valid analysis results - whether tools fail or not.
Use the Custom Reference genome option with the mapping tool. General help with video guides and other quick tips: https://galaxyproject.org/learn/custom-genomes/
DO NOT assign the database metadata attribute as the natively indexed TAIR10 genome.
- Instead, promote the Custom Genome to a Custom Build (detailed help in the same link above).
- Assign that Custom Genome Build as the metadata "Database" attribute to the BAM and all other datasets associated with this genome (generated by tools - if not done by default - plus upload datasets used). Again, this avoids issues and ensures tools use the correct genome build. It is worth the extra steps. No one likes to start over from mapping.
- Mapping tools will not use this database assignment, but many other common tools do, and this proper database assignment will avoid further confusing issues/poor results. The goal is to fully annotate datasets with the actual genome used.

Note that some sources of Custom Reference genomes (in particular those from NCBI, or those assembled yourself) have title lines with complicated/extended annotation - not just the simple chromosome identifiers. Before starting an analysis, clean up the title line so that only chromosome identifiers remain (the ">" line in a fasta dataset) and re-wrap the fasta file lines at 80 bases before creating a Custom genome/build. Use the tool NormalizeFasta (also explained in the link above).

Please try the above and let us know if you need more help. Cheers! Jen, Galaxy team

ADD COMMENT • link modified 18 months ago • written 2.2 years ago by Jennifer Hillman Jackson ♦ 25k

2.2 years ago by

juan.perianez • 20

juan.perianez • 20 wrote:

Hello,

It is now working! I have the numbers of my first sample!

Thanks!!

Juan.

ADD COMMENT • link written 2.2 years ago by juan.perianez • 20

Great! Thanks for letting us know - Jen

ADD REPLY • link written 2.2 years ago by Jennifer Hillman Jackson ♦ 25k

Hello again,

I have one question, I have been looking for the answer but I did not find it.

The question is how many reads can you loose in all the process? what is normal?

For example I have one sample with 49millions of reads when I used tophat I have 43 (87.5% mapped) and later when I use HTseq-Count the final number of reads in all the genes are 23millions.

My samples are not stranded and I am using HTseq-Count with the option for non stranded.

Thanks!

Juan.

ADD REPLY • link written 2.2 years ago by juan.perianez • 20

Reads with multiple alignments, having a mapping quality below the threshold set, that do not map into exons defined in the reference annotation, and a few other factors are not considered in the counts. Htseq-count outputs a tabular file listing the number of reads falling into these categories. A description is included on the tool form, with many more details in the htseq-count manual (it is quite technical but useful to fully understand what exactly the tool is calculating): http://www-huber.embl.de/HTSeq/doc/tour.html#

Using this information it should be possible to optimally tune parameters for your research goals. 50% data loss seems high for such a well annotated genome such as Arabidopsis, but this might be explained by the sequence read content itself (both quality and diversity), the content of the reference annotation, Tophat2 mapping quality, filters and parameters used, and the like. Reviewing the report referenced above will help you to learn the "why" for data loss in the counts and tune parameters.

Ps: Next time for a follow-up question it would be helpful to post a new question post. Prior related posts can be referenced by URL to tie it all together and provide a point of reference and continuity. This also helps other users of this site to locate Q&A topics distinctly.

ADD REPLY • link modified 18 months ago • written 2.2 years ago by Jennifer Hillman Jackson ♦ 25k

Please log in to add an answer.

Similar posts • Search »