htseq-count obtains zero counts

Question: htseq-count obtains zero counts

4.6 years ago by

United States

lrutter • 0 wrote:

I am using the following command:

htseq-count -s no -a 0 FourA.sam hg19.gtf > FourA.count

and getting zero counts in the output. I have tried -s as yes/no (I am almost sure it should be "yes" based off the site I got my files), and -a between 0 and 10. No matter what parameters, I get zero counts.

This is for a quick project, so I am not going for accuracy, I just want any results (of any quality)!

Here is the background of my data
1) Downloaded .sra file.
2) Used fastq-dump to convert .sra file to .fastq file
3) Only kept 1/10 sequences from .fastq file (it was too large).
4) On Galaxy: Ran "Fastq Groomer" tool to remove the leading quality score for the adaptor base (input 'Color Space Sanger', default for the rest)
5) On Galaxy: Ran "Manipulation Reads" to convert to base-space "0123. -> ATCGN" and remove the adaptor itself. This generated fastqcssanger file.
6) On Galaxy: Ran ""Map with BWA for SOLiD" tool on the fastqcssanger file and mrna.fa file

I downloaded the mrna.fa file from: (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/)

Any advice is greatly appreciated. I don't think I have time to rerun/remap. If there is any faster solution for where I went wrong in this pipeline, I would need to do that. I just need something to present, regardless of the quality, and will admit that to my class!!

Below is the head output of some of my files:

hg19.gtf:

chr1 hg19_refGene start_codon 67000042 67000044 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291";
chr1 hg19_refGene CDS 67000042 67000051 0.000000 + 0 gene_id "NM_032291"; transcript_id "NM_032291";
chr1 hg19_refGene exon 66999825 67000051 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291";
chr1 hg19_refGene CDS 67091530 67091593 0.000000 + 2 gene_id "NM_032291"; transcript_id "NM_032291";
chr1 hg19_refGene exon 67091530 67091593 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291";
chr1 hg19_refGene CDS 67098753 67098777 0.000000 + 1 gene_id "NM_032291"; transcript_id "NM_032291";
chr1 hg19_refGene exon 67098753 67098777 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291";
chr1 hg19_refGene CDS 67101627 67101698 0.000000 + 0 gene_id "NM_032291"; transcript_id "NM_032291";
chr1 hg19_refGene exon 67101627 67101698 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291";
chr1 hg19_refGene CDS 67105460 67105516 0.000000 + 0 gene_id "NM_032291"; transcript_id "NM_032291";

mrna.fasta:

>AF001540 1
ggcacgaggcaggtctgtctgttctgttggcaagtaaatgcagtactgtt
ctgatcccgctgctattagaatgcattgtgaaacgactggagtatgatta
aaagttgtgttccccaatgcttggagtagtgattgttgaaggaaaaaatc
cagctgagtgataaggctgagtgttgaggaaatttctgcagttttaagca
gtcgtatttgtgattgaagctgagtacatttgctggtgtatttttaggta
aaatgcttttttgttcatttctgggtggtgggaggggactgaagccttta
gtcttttccagatgcaaccttaaaatcagtgacaagaaacattccaaaca
agcaacagtcttcaagaaattaaactggcaagtggaaatgtttaaacagt
tcagtgatctttagtgcattgtttatgtgtgggtttctctctcccctccc

FourA.sam:

@SQ SN:AF001540 LN:1781
@SQ SN:AF001541 LN:1138
@SQ SN:AF001542 LN:2992
@SQ SN:AF001543 LN:903
@SQ SN:AF001544 LN:434
@SQ SN:AF001545 LN:370
@SQ SN:AF001546 LN:1142
@SQ SN:AF001547 LN:1092
@SQ SN:AF034176 LN:7232
@SQ SN:AF038950 LN:2384

FourA.count (head and tail)

head:
NM_000014 0
NM_000015 0
NM_000016 0
NM_000017 0
NM_000018 0
NM_000019 0
NM_000020 0
NM_000021 0
NM_000022 0
NM_000023 0

tail:
NR_046431 0
NR_046432 0
NR_046433 0
NR_046435 0
NR_046436 0
no_feature 257
ambiguous 0
too_low_aQual 0
not_aligned 6366159
alignment_not_unique 0

count rna-seq deseq2 htseq-count • 6.5k views

ADD COMMENT • link •

modified 4.6 years ago by Jennifer Hillman Jackson ♦ 25k • written 4.6 years ago by lrutter • 0

4.6 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

It appears that your mapping run did not produce any hits. Look further down in the SAM file (past the @SQ lines). You can also generate general counts in Galaxy: run a tool from the SAMTools or Picard group such as 'flagstat' or 'SAM/BAM Alignment Summary Metrics'. From there, you may need to adjust the BWA setting to capture hits. If I remember correctly, the sequence fragments in your mrna.fa custom reference genome were quite short.

Thanks, Jen, Galaxy team

ADD COMMENT • link written 4.6 years ago by Jennifer Hillman Jackson ♦ 25k

Jennifer:

Thank you. Your explanation explains what I have been puzzling over! I am not experienced, and did not know how to check the alignment, I just assumed it was okay.

I ran "flagstat" on one of my BAM files, and it gave me an 11-line output:

6263972 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
406 + 0 mapped (0.01%:-nan%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (-nan%:-nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (-nan%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

I have no idea how to adjust my BWA to capture this, though. Could you help me with how to interpret this?

The seq fragments in my mrna.fa reference genome varied in length. In my .fastq files, they were 50 nt.

But I really need the BAM files by this afternoon - because I have to do additional analyses and present tomorrow :o( Would that even be possible (I could not remember how long the BWA alignment took last time), as it had about 20 hour wait, I think...

Many thanks for your reply!!

ADD REPLY • link written 4.6 years ago by lrutter • 0

Hi, Just stick with this summary, it is enough (Picard can be picky about inputs). It is reporting that the BWA job only succeeded in mapping 406 out of 6263972 total sequences. The hg19.gtf file looks as if it was based off of RefSeq - this is a mismatch for your target database. If this is all true, none of your hits were to a RefSeq accession. This is why your other tool did not produce output, there were no hits to report.

I would consider the mapping a failure. However, you can create a gtf file for the database you mapped against and generate counts for the hits you have. This older post on the UCSC forum has instruction (the file cannot be output from the UCSC Table browser, so cannot be done through the Galaxy "Get Data -> UCSC Main" tool).
http://redmine.soe.ucsc.edu/forum/index.php?t=msg&goto=14375&S=7744d556a73fd27a61976e12db2bc65c

Part of what you talk about could be what went right, what went wrong, and what you learned about making sure that all data inputs are a match. Learning is part of the process.

Next time, in order to do this correctly (gain more hits, etc.) you might consider mapping to a different database and mapping a few times to find the best parameters to fit your data. Was it a requirement to use the mrna division of Genbank? There is a great deal of duplicated and fragmented sequence in there. Using RefSeq (full length transcripts, non-redundant) with the matching .gtf file (what it appears you used before, but double check) would be a good choice.

Also, using a tool such as Megablast could be an option, it depends what the downsteam goal is. If just counting hits to known transcripts, then this tool would be fine. Megablast would require the use of a cloud Galaxy. Amazon has an educational grant program that can cover Amazon usage costs (the Galaxy portion is free everywhere).

Hopefully there is an option in here that will help!

Good luck! Jen, Galaxy team

ADD REPLY • link written 4.6 years ago by Jennifer Hillman Jackson ♦ 25k

Wow, it is like music to my ears to hear that I might not have to remap... because I am rushing, unfortunately this time. It sounds like the first option you provided may be faster to get the count tables? (Next time, I would like to better understand what I am doing, and will read over your many helpful suggestions from this first time :o)!)

Just to make sure I understand correctly, you are simply suggesting that I keep my current SAM/BAM files, and still use HTSeq-count command as follows:

htseq-count -s no -a 0 FourA.sam hg19.gtf > FourA.count

except that instead of using hg19.gtf in the command above, I would use mrna.gtf. And I would generate this mrna.gtf file from the mrna.fa file via the instructions you provided on that UCSC genome support website, i.e. from the command:

zcat all_mrna.txt.gz | cut -f2- | pslToBed stdin stdout | bedToGenePred
stdin stdout | genePredToGtf file stdin all_mrna.gtf)

It just seems strange to me that the input is called all_mrna.txt.gz. Would this simply be my mrna.fa file?

Also, in case that particular command does not work for me (or I have trouble installing the necessary packages - I am terrible with installation!!), what is this process called, in general? I might have to search online for other ways to do this, should this particular command not work :o)

Thanks so much again for your help!!

ADD REPLY • link written 4.6 years ago by lrutter • 0

Hi again,

Yes, I gave you the quick (and dirty) way to do something with the data you have. I would never recommend this for anything except for what you are stating the purpose for!

Follow the UCSC instructions completely. The starting file is from the database behind the UCSC browser - don't worry about the file extension - just get the correct file named this same exact way from downloads area. It is most definitely not the same thing as the fasta file. It is a file is psl format. This is just another tabular datatype with chrom, start, end, plus other goodies. Contains all the essential data to convert to bed, genePred, and GTF (as the transformation workflow goes). Later when you have time, go to the UCSC help pages and look up data types, all are defined.

To do the transformation, you also need to obtain their source utilities. Are pre-complied for various platforms. Ready to use.

Do either:

Go here: http://genome.ucsc.edu/ and follow blue left bar link to Downloads and scroll down to "Source", which is here: http://hgdownload.soe.ucsc.edu/downloads.html#source_downloads
Go directly to here (same place once you click through everything) and pick your OS: http://hgdownload.soe.ucsc.edu/admin/exe/

Use a command like this to get the entire directory once you know which one you want:

wget 'ftp://hgdownload.soe.ucsc.edu/admin/exe/macOSX.x86_64/*'

Follow the command-line given for the transformation, now. Later, how to use each is very intuitive, type the command for usage (in all cases, I believe, I haven't worked there in over 3 years, so something may have changed, but these guys are very good - and the tools are core utilities!).

I happen to be looking at Hiram, their lead developer, literally right now across a parking lot (!) and his infectious calmness is making me calm and I am extending that to you.

There will be no other way to transform the data that I know of. And this is easier than it may seem at first. Give it a try and see.

ADD REPLY • link modified 4.6 years ago • written 4.6 years ago by Jennifer Hillman Jackson ♦ 25k

Jennifer:

Thanks for your help, and thanks for sending calmness my way via Hiram! :o)

Your suggestion did not work, but I think I found the problem. I was working with RNA-seq, and one of my classmates said BWA is better for DNA. When I tried using bowtie with -c option (for colorspace), there were much more mappings, so I went with that.

Thanks for all your help this past week. I really appreciate it!!

ADD REPLY • link written 4.6 years ago by lrutter • 0

Below is the output "log" of the SAM/BAM Alignment Summary Metrics on a BAM file (I bolded a line that does not look too good):

INFO:root:## executing java -Xmx4g -Djava.io.tmpdir='/galaxy/main/scratch' -jar /galaxy/main/deps/picard/1.56.0/devteam/picard/e0232cbac965/jars/CreateSequenceDictionary.jar REFERENCE=/galaxy/main/scratch/CollectAlignmentSummaryMetricsNnX2Ds.fasta OUTPUT=/galaxy/main/scratch/CollectAlignmentSummaryMetricsNnX2Ds.dict URI=dataset_8089824.dat TRUNCATE_NAMES_AT_WHITESPACE=None returned status 1 and stderr:
Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/tmp
[Mon May 05 10:49:41 CDT 2014] net.sf.picard.sam.CreateSequenceDictionary REFERENCE=/galaxy/main/scratch/CollectAlignmentSummaryMetricsNnX2Ds.fasta OUTPUT=/galaxy/main/scratch/CollectAlignmentSummaryMetricsNnX2Ds.dict URI=dataset_8089824.dat TRUNCATE_NAMES_AT_WHITESPACE=false NUM_SEQUENCES=2147483647 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
[Mon May 05 10:49:41 CDT 2014] Executing as g2main@roundup51.tacc.utexas.edu on Linux 2.6.32-358.23.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_40-b43
[Mon May 05 10:49:41 CDT 2014] net.sf.picard.sam.CreateSequenceDictionary done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=2025324544
Exception in thread "main" net.sf.samtools.SAMException: Sequence name contains invalid character: AF001540 1
   at net.sf.samtools.SAMSequenceRecord.<init>(SAMSequenceRecord.java:80)
   at net.sf.picard.sam.CreateSequenceDictionary.makeSequenceRecord(CreateSequenceDictionary.java:147)
   at net.sf.picard.sam.CreateSequenceDictionary.makeSequenceDictionary(CreateSequenceDictionary.java:138)
   at net.sf.picard.sam.CreateSequenceDictionary.doWork(CreateSequenceDictionary.java:113)
   at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:175)
   at net.sf.picard.sam.CreateSequenceDictionary.main(CreateSequenceDictionary.java:93)

INFO:root:## executing samtools sort /galaxy-repl/main/files/008/119/dataset_8119150.dat dataset_8119150.dat.sorted returned status 0 and nothing on stderr

INFO:root:## executing java -Xmx4g -Djava.io.tmpdir='/galaxy/main/scratch' -jar /galaxy/main/deps/picard/1.56.0/devteam/picard/e0232cbac965/jars/CollectAlignmentSummaryMetrics.jar VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=true ADAPTER_SEQUENCE= IS_BISULFITE_SEQUENCED=false MAX_INSERT_SIZE=100000 OUTPUT=/galaxy/main/jobdir/006/922/6922621/dataset_8126499_files/CollectAlignmentSummaryMetrics.metrics.txt R=/galaxy/main/jobdir/006/922/6922621/dataset_8126499_files/CollectAlignmentSummaryMetricsNnX2Ds.fasta_fake.fasta TMP_DIR=/galaxy/main/scratch INPUT=/galaxy/main/jobdir/006/922/6922621/dataset_8126499_files/dataset_8119150.dat.sorted.bam returned status 0 and nothing on stderr

And the output "txt" of the same program:

## net.sf.picard.metrics.StringHeader
# net.sf.picard.analysis.CollectAlignmentSummaryMetrics MAX_INSERT_SIZE=100000 ADAPTER_SEQUENCE=[AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT, AGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG, AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT, AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG, AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT, AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNNNATCTCGTATGCCGTCTTCTGCTTG, IS_BISULFITE_SEQUENCED=false] INPUT=/galaxy/main/jobdir/006/922/6922621/dataset_8126499_files/dataset_8119150.dat.sorted.bam OUTPUT=/galaxy/main/jobdir/006/922/6922621/dataset_8126499_files/CollectAlignmentSummaryMetrics.metrics.txt REFERENCE_SEQUENCE=/galaxy/main/jobdir/006/922/6922621/dataset_8126499_files/CollectAlignmentSummaryMetricsNnX2Ds.fasta_fake.fasta ASSUME_SORTED=true TMP_DIR=[/galaxy/main/scratch] VALIDATION_STRINGENCY=LENIENT METRIC_ACCUMULATION_LEVEL=[ALL_READS] IS_BISULFITE_SEQUENCED=false STOP_AFTER=0 VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
## net.sf.picard.metrics.StringHeader
# Started on: Mon May 05 10:50:52 CDT 2014

## METRICS CLASS   net.sf.picard.analysis.AlignmentSummaryMetrics
CATEGORY   TOTAL_READS   PF_READS   PCT_PF_READS   PF_NOISE_READS   PF_READS_ALIGNED   PCT_PF_READS_ALIGNED   PF_ALIGNED_BASES   PF_HQ_ALIGNED_READS   PF_HQ_ALIGNED_BASES   PF_HQ_ALIGNED_Q20_BASES   PF_HQ_MEDIAN_MISMATCHES   PF_MISMATCH_RATE   PF_HQ_ERROR_RATE   PF_INDEL_RATE   MEAN_READ_LENGTH   READS_ALIGNED_IN_PAIRS   PCT_READS_ALIGNED_IN_PAIRS   BAD_CYCLES   STRAND_BALANCE   PCT_CHIMERAS   PCT_ADAPTER   SAMPLE   LIBRARY   READ_GROUP
UNPAIRED   6263972   6263972   1   0   406   0.000065   19883   63   3078   2442   0   0.00176   0.004873   0.001257   49.999925   0   0   0   0.578818   0   0

ADD REPLY • link written 4.6 years ago by lrutter • 0

When I do a head on the SAM file used to generate the BAM file used in the "SAM/BAM Alignment Summary Metrics", I get the following (where the first line contains the characters that caused the bolded exception in my comment right above):

@SQ   SN:AF001540   LN:1781
@SQ   SN:AF001541   LN:1138
@SQ   SN:AF001542   LN:2992
@SQ   SN:AF001543   LN:903
@SQ   SN:AF001544   LN:434
@SQ   SN:AF001545   LN:370
@SQ   SN:AF001546   LN:1142
@SQ   SN:AF001547   LN:1092
@SQ   SN:AF034176   LN:7232
@SQ   SN:AF038950   LN:2384

ADD REPLY • link written 4.6 years ago by lrutter • 0

Similar posts • Search »