Question: Filtering BAM files from HISAT2
0
gravatar for dexter.myrick
8 weeks ago by
dexter.myrick30 wrote:

Hi. I am new to rna-seq and I have a couple of quick questions. My input was paired-end non-stranded fastq files. Below is an example of the summary stats from one of my samples. Are these results acceptable/ within normal range expected?

I was also planning on filtering the bam files based on mapq scores to get rid of low quality reads. I read that trimming before alignment can affect alignment and introduce bias into the downstream estimation of read counts. So I did not trim before running HISAT2. Is this an appropriate thing to do?

Should I also filter out unpaired reads and reads not uniquely mapped? It was my understanding that most unpaired reads were produced by quality score trimming, but I did not do any trimming. Is this an expected amount of unpaired reads? Also, I understand how to filter how unpaired reads using SAMtools but I don't know how to filter out multi-mapped reads or even if I should. I can't seem to find anywhere what the pros and cons are of keeping or getting rid of unpaired reads and/or multi-mapped reads. Thanks!!

HISAT2 summary stats:
    Total pairs: 32249562
        Aligned concordantly or discordantly 0 time: 3651029 (11.32%)
        Aligned concordantly 1 time: 26084216 (80.88%)
        Aligned concordantly >1 times: 1290968 (4.00%)
        Aligned discordantly 1 time: 1223349 (3.79%)
    Total unpaired reads: 7302058
        Aligned 0 time: 3915231 (53.62%)
        Aligned 1 time: 2996094 (41.03%)
        Aligned >1 times: 390733 (5.35%)
    Overall alignment rate: 93.93%
rna-seq alignment hisat2 bam • 148 views
ADD COMMENTlink modified 8 weeks ago by Jennifer Hillman Jackson25k • written 8 weeks ago by dexter.myrick30
0
gravatar for Jennifer Hillman Jackson
8 weeks ago by
United States
Jennifer Hillman Jackson25k wrote:

Hello,

The alignment rates are very good, including the low number of unpaired. The important parts (usually) are these metrics:

Aligned concordantly 1 time: 26084216 (80.88%)

Overall alignment rate: 93.93%

For more details about evaluating NGS reads in general and the types of filtering/QA to do for specific analysis workflows are covered in the Galaxy tutorials here, along with links to external resources (publications, discussions):

Hope that helps! Jen, Galaxy team

ADD COMMENTlink written 8 weeks ago by Jennifer Hillman Jackson25k

Thanks for the help! That answers the first part of my question. However, I have looked at those tutorials and nothing there addresses the second part of my question. Is it appropriate to skip trimming and filter bam files based on mapq scores after alignment? Pro and cons of filtering out unpaired reads and multiple-mapped reads. How to filter out multiple-mapped reads in SAM-tools?

ADD REPLYlink written 8 weeks ago by dexter.myrick30

Yes, those portions are more of a judgment-based decision and one answer is not definitive across all analysis workflows. Still, here is a bit more info:

  1. Trimming versus post alignment filtering: Trimming can help sequences to get aligned but is not always necessary (sequences that are all/mostly artifact would fall out during alignment anyway). Maybe run FastQC on the original data, then Trimmomatic > FastQC on the same data, compare FastQC results, then map both and compare the alignment rates/quality.

  2. Filter by MAPQ: Yes, do this, especially if calling variants. How-to in the context of example analysis is covered in the variant analysis tutorials.

  3. Unpaired reads: Some tools consider these during execution and some do not. Others require that the inputs are strictly paired to start with. For tools that utilize unpaired, these orphan reads can produce spurious results. Now, sometimes that is Ok, for example: one is data mining in a specific region and all available evidence is wanted for a human to review and make decisions about the result. Again, you could try both with whatever tools you are using and compare the differences.

  4. Multi-mapped reads: Alignment tools retained multiple hits because each is considered just as "good" as the others (if only primary alignments are reported, more below). Try filtering by properly paired mapped reads with Filter BAM (and other features, if desired. Tool: NGS: SAMtools >> Filter SAM or BAM, output SAM or BAM files on FLAG MAPQ RG LN or by region

If an unpaired read is multi-mapping, this could be an example of what is probably a spurious result, e.g. non-specific hit with only one evidence point (where paired-data has two to start with, then if properly paired a third). Properly paired reads that have more than one hit are mapping to a duplicated (or near duplicated) genome region. The Filter SAM or BAM tool can filter on the bitwise flags - example: only pick the primary alignment(s). You can mark duplicates with the NGS Picard >> Mark Duplicates tool (covered in the variant analysis tutorials).

HISAT2 is a good mapper for most use cases. See the advanced options, especially Reporting options >> Primary alignment. Secondary alignments are already filtered out by default (but that can be adjusted).

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by Jennifer Hillman Jackson25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 106 users visited in the last hour