Edit 1: it appears I have misinterpreted the raw data, which in turn led to my poor results. The library construction is explained here: http://presto.readthedocs.io/en/latest/workflows/VanderHeiden2017_Workflow.html So it would seem, to my understanding, that a full transcriptome analysis would not be possible with this data set?
Hello everyone,
I am new to RNA-seq analysis and am having trouble getting my mapping alignment rate above ~36% with STAR (top hat and HISAT2 produced similar results) and the majority of my unmapped reads are listed as "too short". I trim off the adapter sequences and eliminate low quality reads before mapping and am still having the same issue. I'm using data from BioProject accession PRJNA338795, (2x300 bp paired-end sequencing with a 20% PhiX spike on the Illumina MiSeq platform according to the manufacturer’s recommendations), and am having similar results with all samples in the study.
Some help would definitely be appreciated, am I missing something? Links to my workflow and the full article from the study the data is from are listed below:
Workflow: https://usegalaxy.org/u/clanger/w/collin-rna-seq-for-two-groups
Article: http://www.jimmunol.org/content/jimmunol/198/4/1460.full.pdf
STAR log from SRR4026011 post trim:
Mapping speed, Million of reads per hour | 5.79
Number of input reads | 537330
Average input read length | 505
UNIQUE READS:
Uniquely mapped reads number | 198253
Uniquely mapped reads % | 36.90%
Average mapped length | 426.95
Number of splices: Total | 264649
Number of splices: Annotated (sjdb) | 180334
Number of splices: GT/AG | 224025
Number of splices: GC/AG | 1730
Number of splices: AT/AC | 67
Number of splices: Non-canonical | 38827
Mismatch rate per base, % | 1.47%
Deletion rate per base | 0.02%
Deletion average length | 3.06
Insertion rate per base | 0.01%
Insertion average length | 1.68
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 102546
% of reads mapped to multiple loci | 19.08%
Number of reads mapped to too many loci | 264
% of reads mapped to too many loci | 0.05%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 40.99%
% of reads unmapped: other | 2.98%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
STAR log from SRR4026011 pre trim:
Mapping speed, Million of reads per hour | 5.15
Number of input reads | 810841
Average input read length | 602
UNIQUE READS:
Uniquely mapped reads number | 151954
Uniquely mapped reads % | 18.74%
Average mapped length | 496.60
Number of splices: Total | 185516
Number of splices: Annotated (sjdb) | 136591
Number of splices: GT/AG | 163633
Number of splices: GC/AG | 952
Number of splices: AT/AC | 71
Number of splices: Non-canonical | 20860
Mismatch rate per base, % | 1.61%
Deletion rate per base | 0.02%
Deletion average length | 3.32
Insertion rate per base | 0.01%
Insertion average length | 1.44
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 52464
% of reads mapped to multiple loci | 6.47%
Number of reads mapped to too many loci | 19
% of reads mapped to too many loci | 0.00%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 70.19%
% of reads unmapped: other | 4.60%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
Hi we often noticed that 2x 300bp reads give lower than anticipated mapping. It is my understanding that all mapping programs essentially have an absolute cut off for mapping rather than a % of their length(because of the way k-mers are used). So the default cut off for mapping might be 2 bp regardless of whether you are using 2x300bp or 2x75bp. This might be something to think about. Particularly in the context that the base quality often drops off dramatically after 150bp- so errors may limit mapping. You are using ' Maximum mismatch count which will still allow a full match to be performed= 2' which is probably too low try increasing this and see what happens . I do not use Triomatic so am cautious about giving advice.
The other thing would be to know the insert size distribution (tool : CollectInsertSizeMetrics ) if your insert size is much less than 2x300 then you are essentially sequencing the same thing twice, it does not explain low mapping but it will give you an idea if the expensive 2x300bp is cost efficient. not an answer to your question but might be something to think about
cheers
Guy