Hi,
I have imported some fastq files (homo sapiens) from http://www.ebi.ac.uk/ena/data/ to Galaxy. In order to obtain read counts, I aligned them to hg19 using HiSat (default parameters). Then since my reference genome was hg19, I used GTF file (Version 19 (July 2013 freeze, GRCh37) - Ensembl 74, 75) from Gencode to obtain read counts using htseq.
The total number of counts obtained for features is "10347508" which seems to be ok. While I have lost a number of counts about
__no_feature 2362227 __ambiguous 788874 __too_low_aQual 1001993 __not_aligned 2517255 __alignment_not_unique 3866370
Do you think the result is reasonable?
Something confusing is that from total 57820 genes, the counts for each gene up to gene 18356 are mostly non-zero, but counts for each gene from gene 18356 to gene 57820 are mostly zero (a few of them are non-zero).
Why is that?
Do you think I have to change my GTF file? Which version?
Or do you think I have to consider only the first 18356 genes for DE analysis ?
Thanks