BOWTIE BAM file alignments

Question: BOWTIE BAM file alignments

3.9 years ago by

United States

rooneyl • 0 wrote:

I prepared a small RNA (28-34nt) library made up of ribosome protected RNA fragments. I prepared my library for illumina sequencing by RT-PCR and adaptor ligation.

Results were excellent with ~3-4M reads per each indexed data set, six total, very high QC results.

After adaptor trimming and size selection of >28nt; Bowtie alignment of my reads to a rabbit genome (oryCun2) gives me 53% alignment with 51% aligned >1 time. This sample is a mixture of rabbit and human as it is a cell free translation reaction of huCFTR. Perhaps some of the alignments are missed at the intron exon boundaries.

If I display these BAM files on the UCSC main will I be looking at all possible alignments or a single best alignment for each read?

galaxy • 907 views

ADD COMMENT • link •

modified 3.9 years ago by Jennifer Hillman Jackson ♦ 25k • written 3.9 years ago by rooneyl • 0

3.9 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

It depends on how you set up the parameters for the mapping. At default, all hits that are within the same range of passing the mapping criteria will be retained.

If you just want the best hits, you can always filter using SAMTools by mapping flags, according to the criteria you set.

Hopefully this helps, Jen, Galaxy team

ADD COMMENT • link written 3.9 years ago by Jennifer Hillman Jackson ♦ 25k

Hi Jennifer, Yes that helps, can you tell me what a good minimum MAPQ quality score would be for a perfect match. Is the screen shot attached the correct location to do this? Can I leave the other categories blank? Is there another step after running this SAMtools? LeeAnn

ADD REPLY • link written 3.9 years ago by rooneyl • 0

Hello,

I like this summary for reference:
http://maq.sourceforge.net/qual.shtml

Best hits that are nearly the same and a single best hit per read are different things, and teasing these apart is not possible for large, short read, datasets as far as I know. This is why some software tools specifically avoid simple, absolute, abundance counts. And it is true that you may get a slightly stronger match, but not significantly so, to a region that biologically the read was not derived from. Retaining all above a threshold (30 is reasonable) is often the best path. For exacts, you can go higher, but will lose content and risk biasing the results.

That said, you will almost certainly have some reads that will map with identical quality to multiple locations. The shorter the read, the more often this will occur. All above the threshold you set should probably be considered valid, but you can always ignore hits to pseudogenes or repetitive regions and such by using a reference annotation dataset that excludes these in downstream steps.

Paired-end data helps to reduce the noise (this adds in an additional confidence layer - via properly paired mapped reads), but there can still be some cross-over hits. It is probably more important to look at the coverage for each region of interest toward the end of analysis, rather than to worry about multi-mapping reads at the beginning.

I realize this does not give you exactly the answer that you want, but perhaps explains the data a bit more.

Best, Jen, Galaxy team

ADD REPLY • link written 3.9 years ago by Jennifer Hillman Jackson ♦ 25k

Similar posts • Search »