error with sam to bam conversion

2.0 years ago by

United Kingdom

Hi Dannon

This is the process I followed: 1) I generated a SAM file after mapping two different reference sequences, named as RS16000389_V3_Ref_1 and Ref_2, simultaneously to my FASTQ files and removed unmapped reads and secundary/suplementary alignments. 2) I used the tool "Filter data on any column using simple expressions" to generate two different SAM files with the reads mapping to each reference sequence. 3) I compared these two SAM files to find the unique reads in each file (since I am not interested in the reads mapping both reference sequences). 4) I convert SAM to BAM, but only it worked for the Reference 1.

I have checked the Reference 1 SAM file and I have found these expressions in the column OPT for all the reads:

NM:i:0 MD:Z:151 AS:i:151 XS:i:0 NM:i:0 MD:Z:151 AS:i:151 XS:i:113

However, I have found the following expressions very often in the Reference 2 SAM file:

NM:i:1 MD:Z:1A90 AS:i:90 XS:i:82 XA:Z:RS16000389_V3_Ref_1,+531,92M,2; NM:i:1 MD:Z:1A90 AS:i:90 XS:i:82 XA:Z:RS16000389_V3_Ref_1,-531,92M,2;

and sometimes these one: NM:i:0 MD:Z:66 AS:i:66 XS:i:37 NM:i:0 MD:Z:66 AS:i:66 XS:i:37

I guess "XA:Z:RS16000389_V3_Ref_1,+531,92M,2;" means that that read also matches 92 nucleotides to the reference 1, but what do the other parameters mean?

Is this the reason I can not generate a BAM file for the reference 2?

Is there any way of filtering unique reads for each reference sequences?

Thank you for your help

Juan

ADD COMMENT • link written 2.0 years ago by Juan Ledesma • 0

Hi Juan, Do you retain the SAM header after you use the Filter tool?

It seems like you are getting reads mapping partially to both reference sequences that may be an issue. You could try aligning your data in two separate runs (one for each reference) and comparing the outputs based on the read ID with the 'Compare Two Datasets' tool work to get uniquely mapping reads.

ADD REPLY • link written 2.0 years ago by Mo Heydarian ♦ 830

Hi Mo I have tried the tool that you have suggested and it seems that I get unique mapping reads. Thank you However, i think it will be very difficult to use this approach to analyse viral quasispecies or close related viral populations in the same sample using Galaxy.

ADD REPLY • link written 2.0 years ago by Juan Ledesma • 0

Hi Juan, That is great to hear.

Feel free to expand on how performing your analysis will be difficult within Galaxy. We value the feedback of our users.

If you are concerned with having to manually launch each alignment job for an individual reference, don't be. If you have ten reference sequences and one FASTQ file you would like to have aligned, you can launch the one FASTQ file and all ten reference sequences from one tool form by entering your reference sequences in batch mode (the middle button to the left of the input box). This will launch one alignment job per reference provided. Once the jobs have run you can capture this (potentially complex) array of alignment jobs in a workflow by using the 'Extract workflow' feature in the history menu.

You could use a combination of tools from here to resolve the reads that align uniquely to each reference provided. Here is an example workflow (use this link to import the workflow to your Galaxy):https://raw.githubusercontent.com/MoHeydarian/Workshed/master/Galaxy-Workflow-2016.11.17_Alignment_and_resolution_of_uniquely_mapped_reads_from_viral_populations.ga

Hope this is helpful!

Cheers,

Mo Heydarian, Galaxy Team

ADD REPLY • link written 2.0 years ago by Mo Heydarian ♦ 830

Similar posts • Search »