What to do when alignment rate is low even though the genomic data and RNA-seq data are of same stain

Question: What to do when alignment rate is low even though the genomic data and RNA-seq data are of same stain

2.6 years ago by

New Delhi, India, ICGEB / JNU / IIT (BHU)

Hello

While doing RNA-seq analysis, when I mapped reads for each condition to the reference genome (of same stain of Geobacillus sp.) with TopHat I get quite low percentage (lower then 60 % in each condition) of overall mapped alignment rate, for example, in following alignment summary I am not able to understand why the alignment rate is low even though the genomic data and RNA-seq data are from same stain. Can anyone please help me to interpret from the following alignment summary? Is something wrong with RNA-seq data?

Even size of mapped bam files are 6G (size on drive) and Unmapped bam files are less than 100M.

Left reads:

      Input     :  13923415
       Mapped   :   7248369 (52.1% of input)
        of these:   6893771 (95.1%) have multiple alignments (306448 have >20)

Right reads:

      Input     :  13923415
       Mapped   :   7103432 (51.0% of input)
        of these:   6748616 (95.0%) have multiple alignments (306338 have >20)

51.5% overall read mapping rate.

Aligned pairs: 5267947

 of these:   4923439 (93.5%) have multiple alignments
               29026 ( 0.6%) are discordant alignments

37.6% concordant pair alignment rate

Best Regards

Mayank

assembly tophat alignment bowtie rna-seq • 4.5k views

ADD COMMENT • link •

modified 2.6 years ago by Jennifer Hillman Jackson ♦ 25k • written 2.6 years ago by mayankg.it.bhu • 0

hi mayank ..could you improve your mapping perentage. ia m also stuck with the sam e. have tried trimming , clipping etc. but n o lucky yet.

ADD REPLY • link written 24 months ago by computationalvarun • 20

2.6 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hi Mayank,

Thanks for posting the question to Biostars.

I've seen this occur for a few reasons, some of which can be mitigated:

The input fastq sequence has incorrect quality score scaling. In most cases like this, fastqillumina was assigned to fastqsanger (directly or by using the wrong Fastq Groomer options). Double check your data with this method: https://wiki.galaxyproject.org/Support#FASTQ_Datatype_QA
The input sequence length is not twice as long as the setting for "Minimum length of read segments". This option is found on the tool form under TopHat settings to use -> Full parameter list.
Too much trimming or other QA. Often RNA-seq data can be mapped successfully with very little manipulation (as is the case with most expression data of any type, in my opinion - but other's opinion may differ!). If you did QA, consider relaxing the parameters to preserve more of the sequence or try a test run with very litte or no trimming.
Mixed up samples or mixed up forward/reverse reads entered on the tool form. It happens - use the re-run button to double check what was entered (assuming the samples are labeled correctly in Galaxy - going upstream to confirm this might be needed).
Finally, there could be an inherent data problem. In library prep or downstream in sequencing. This is the last thing to check after the informtics is confirmed to be good above. Tophat settings can be adjusted sometimes to help improve overall mapping and concordant pairs. It could be worth reviewing the manual for how the parameters interact and run a few tests to see if the results can be improved.

Best, Jen, Galaxy team

ADD COMMENT • link written 2.6 years ago by Jennifer Hillman Jackson ♦ 25k

Hey Jennifer,

I have Illumina HiSeq1000 Sequencing data (Paired-End RNA-seq data, No. Of Cycles : 2 X 100), I have not perform trimming on the Reads but Quality Control reports show high fluctuation upto 10 bps in per base sequence content graph1 , sequence length is of 100 bps, and in sequence duplication levels there are some abnormalities graph2 , are these two going to affect alignment rate?