Question: All reads are 0 in htseq-count
1
gravatar for juan.perianez
19 months ago by
juan.perianez20 wrote:

Hello,

I have a trouble using htseq-count, I used TopHat to aling my data and it looks ok but when I tried to quantify the number of reads all the genes present 0 read.

You know somethign that I could change to obtain the number of reads?

Thanks!!

Juan.

ADD COMMENTlink modified 19 months ago • written 19 months ago by juan.perianez20
1

Do the chromosome names in your GTF and BAM file match? That's the most frequent cause of this.

ADD REPLYlink written 19 months ago by Devon Ryan1.9k

Agree with Devon. Try running BAM-to-SAM on the aligned data outputting just the SAM header and compare the chromosome identifiers between all inputs. More: https://wiki.galaxyproject.org/Support#Reference_genomes

Jen, Galaxy team

ADD REPLYlink written 19 months ago by Jennifer Hillman Jackson24k

Hi,

Thanks for your help, I do not know really how to check this, It is the first time that I use Galaxy. I took the GTF file from the Arabidopsis_thaliana_Ensembl_TAIR10.tar and it present the format that I show below:

Seqname Source Feature Start End Score Strand Frame Attributes 1 ensembl UTR 3631 3759 . + . gene_biotype "protein_coding"; gene_id "AT1G01010"; gene_name "NAC001"; gene_source "ensembl"; gene_version "1"; p_id "P20332"; transcript_biotype "protein_coding"; transcript_id "AT1G01010.1"; transcript_name "ANAC001"; transcript_source "ensembl"; transcript_version "1"; tss_id "TSS22525";

The BAM file is the output of th TopHat program in Galaxy.

Is it the problem?

Thanks!

ADD REPLYlink written 19 months ago by juan.perianez20
1

Did you try the help in the wiki? This is the detailed "how-to-check" linked from the general help shared in my first comment: https://wiki.galaxyproject.org/Support/ChromIdentifiers

Learning how to do this will avoid many headaches in the future. Input mismatch problems are very common and can be avoided. Data with mismatches will never process correctly to produce valid results, whether the tool errors or not.

I am also going into your account to check. Will write back when done with exactly how I checked your particular data as a reply.

ADD REPLYlink modified 19 months ago • written 19 months ago by Jennifer Hillman Jackson24k
2
gravatar for Jennifer Hillman Jackson
19 months ago by
United States
Jennifer Hillman Jackson24k wrote:

Hello,

I checked your history with the analysis. There is a reference genome mismatch problem between the natively indexed TAIR10 genome at Galaxy Main (http://usegalaxy.org) and the GTF reference annotation from iGenomes.

Good news!! You are very close to obtaining a successful run and avoiding issues with other tools. The genome.fa from the same annotation bundle from iGenomes is already in your history! To solve the current problem and later potential issues, do the following:

  • Re-map the reads again the TAIR10 *genome.fa genome from the bundle*. This will ensure that all data going forward is based on the same exact reference genome with identical chromosome identifiers. This is very important to obtain valid analysis results - whether tools fail or not.
  • Use the Custom Reference genome option with the mapping tool. General help with video guides and other quick tips: https://galaxyproject.org/learn/custom-genomes/
  • DO NOT assign the database metadata attribute as the natively indexed TAIR10 genome.
    • Instead, promote the Custom Genome to a Custom Build (detailed help in the same link above).
    • Assign that Custom Genome Build as the metadata "Database" attribute to the BAM and all other datasets associated with this genome (generated by tools - if not done by default - plus upload datasets used). Again, this avoids issues and ensures tools use the correct genome build. It is worth the extra steps. No one likes to start over from mapping.
    • Mapping tools will not use this database assignment, but many other common tools do, and this proper database assignment will avoid further confusing issues/poor results. The goal is to fully annotate datasets with the actual genome used.

Note that some sources of Custom Reference genomes (in particular those from NCBI, or those assembled yourself) have title lines with complicated/extended annotation - not just the simple chromosome identifiers. Before starting an analysis, clean up the title line so that only chromosome identifiers remain (the ">" line in a fasta dataset) and re-wrap the fasta file lines at 80 bases before creating a Custom genome/build. Use the tool NormalizeFasta (also explained in the link above).

Please try the above and let us know if you need more help. Cheers! Jen, Galaxy team

ADD COMMENTlink modified 11 months ago • written 19 months ago by Jennifer Hillman Jackson24k
1
gravatar for juan.perianez
19 months ago by
juan.perianez20 wrote:

Hello,

It is now working! I have the numbers of my first sample!

Thanks!!

Juan.

ADD COMMENTlink written 19 months ago by juan.perianez20

Great! Thanks for letting us know - Jen

ADD REPLYlink written 19 months ago by Jennifer Hillman Jackson24k

Hello again,

I have one question, I have been looking for the answer but I did not find it.

The question is how many reads can you loose in all the process? what is normal?

For example I have one sample with 49millions of reads when I used tophat I have 43 (87.5% mapped) and later when I use HTseq-Count the final number of reads in all the genes are 23millions.

My samples are not stranded and I am using HTseq-Count with the option for non stranded.

Thanks!

Juan.

ADD REPLYlink written 19 months ago by juan.perianez20

Reads with multiple alignments, having a mapping quality below the threshold set, that do not map into exons defined in the reference annotation, and a few other factors are not considered in the counts. Htseq-count outputs a tabular file listing the number of reads falling into these categories. A description is included on the tool form, with many more details in the htseq-count manual (it is quite technical but useful to fully understand what exactly the tool is calculating): http://www-huber.embl.de/HTSeq/doc/tour.html#

Using this information it should be possible to optimally tune parameters for your research goals. 50% data loss seems high for such a well annotated genome such as Arabidopsis, but this might be explained by the sequence read content itself (both quality and diversity), the content of the reference annotation, Tophat2 mapping quality, filters and parameters used, and the like. Reviewing the report referenced above will help you to learn the "why" for data loss in the counts and tune parameters.

Ps: Next time for a follow-up question it would be helpful to post a new question post. Prior related posts can be referenced by URL to tie it all together and provide a point of reference and continuity. This also helps other users of this site to locate Q&A topics distinctly.

ADD REPLYlink modified 11 months ago • written 19 months ago by Jennifer Hillman Jackson24k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 155 users visited in the last hour