Question: How can I improve my mapping alignment rate with MiSeq 2x300 bp paired-end reads?
2
gravatar for collinmlanger
12 weeks ago by
collinmlanger20 wrote:

Edit 1: it appears I have misinterpreted the raw data, which in turn led to my poor results. The library construction is explained here: http://presto.readthedocs.io/en/latest/workflows/VanderHeiden2017_Workflow.html So it would seem, to my understanding, that a full transcriptome analysis would not be possible with this data set?

Hello everyone,

I am new to RNA-seq analysis and am having trouble getting my mapping alignment rate above ~36% with STAR (top hat and HISAT2 produced similar results) and the majority of my unmapped reads are listed as "too short". I trim off the adapter sequences and eliminate low quality reads before mapping and am still having the same issue. I'm using data from BioProject accession PRJNA338795, (2x300 bp paired-end sequencing with a 20% PhiX spike on the Illumina MiSeq platform according to the manufacturer’s recommendations), and am having similar results with all samples in the study.

Some help would definitely be appreciated, am I missing something? Links to my workflow and the full article from the study the data is from are listed below:

Workflow: https://usegalaxy.org/u/clanger/w/collin-rna-seq-for-two-groups

Article: http://www.jimmunol.org/content/jimmunol/198/4/1460.full.pdf

STAR log from SRR4026011 post trim:

Mapping speed, Million of reads per hour | 5.79

                      Number of input reads |   537330
                  Average input read length |   505
                                UNIQUE READS:
               Uniquely mapped reads number |   198253
                    Uniquely mapped reads % |   36.90%
                      Average mapped length |   426.95
                   Number of splices: Total |   264649
        Number of splices: Annotated (sjdb) |   180334
                   Number of splices: GT/AG |   224025
                   Number of splices: GC/AG |   1730
                   Number of splices: AT/AC |   67
           Number of splices: Non-canonical |   38827
                  Mismatch rate per base, % |   1.47%
                     Deletion rate per base |   0.02%
                    Deletion average length |   3.06
                    Insertion rate per base |   0.01%
                   Insertion average length |   1.68
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |   102546
         % of reads mapped to multiple loci |   19.08%
    Number of reads mapped to too many loci |   264
         % of reads mapped to too many loci |   0.05%
                              UNMAPPED READS:
   % of reads unmapped: too many mismatches |   0.00%
             % of reads unmapped: too short |   40.99%
                 % of reads unmapped: other |   2.98%
                              CHIMERIC READS:
                   Number of chimeric reads |   0
                        % of chimeric reads |   0.00%

STAR log from SRR4026011 pre trim:

Mapping speed, Million of reads per hour | 5.15

                      Number of input reads |   810841
                  Average input read length |   602
                                UNIQUE READS:
               Uniquely mapped reads number |   151954
                    Uniquely mapped reads % |   18.74%
                      Average mapped length |   496.60
                   Number of splices: Total |   185516
        Number of splices: Annotated (sjdb) |   136591
                   Number of splices: GT/AG |   163633
                   Number of splices: GC/AG |   952
                   Number of splices: AT/AC |   71
           Number of splices: Non-canonical |   20860
                  Mismatch rate per base, % |   1.61%
                     Deletion rate per base |   0.02%
                    Deletion average length |   3.32
                    Insertion rate per base |   0.01%
                   Insertion average length |   1.44
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |   52464
         % of reads mapped to multiple loci |   6.47%
    Number of reads mapped to too many loci |   19
         % of reads mapped to too many loci |   0.00%
                              UNMAPPED READS:
   % of reads unmapped: too many mismatches |   0.00%
             % of reads unmapped: too short |   70.19%
                 % of reads unmapped: other |   4.60%
                              CHIMERIC READS:
                   Number of chimeric reads |   0
                        % of chimeric reads |   0.00%
rna-seq alignment • 177 views
ADD COMMENTlink modified 12 weeks ago by Guy Reeves1.0k • written 12 weeks ago by collinmlanger20
2

Hi we often noticed that 2x 300bp reads give lower than anticipated mapping. It is my understanding that all mapping programs essentially have an absolute cut off for mapping rather than a % of their length(because of the way k-mers are used). So the default cut off for mapping might be 2 bp regardless of whether you are using 2x300bp or 2x75bp. This might be something to think about. Particularly in the context that the base quality often drops off dramatically after 150bp- so errors may limit mapping. You are using ' Maximum mismatch count which will still allow a full match to be performed= 2' which is probably too low try increasing this and see what happens . I do not use Triomatic so am cautious about giving advice.

The other thing would be to know the insert size distribution (tool : CollectInsertSizeMetrics ) if your insert size is much less than 2x300 then you are essentially sequencing the same thing twice, it does not explain low mapping but it will give you an idea if the expensive 2x300bp is cost efficient. not an answer to your question but might be something to think about

cheers

Guy

ADD REPLYlink modified 12 weeks ago • written 12 weeks ago by Guy Reeves1.0k
0
gravatar for Guy Reeves
12 weeks ago by
Guy Reeves1.0k
Germany
Guy Reeves1.0k wrote:

If you have data as described in edit 1 which is limited to V(D)J loci you are correct you cannot do a 'a full transcriptome analysis' as it would have no data about other genes.
Cheers

Guy

ADD COMMENTlink written 12 weeks ago by Guy Reeves1.0k

though I may not have fully understood

ADD REPLYlink written 12 weeks ago by Guy Reeves1.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 101 users visited in the last hour