Question: BOWTIE BAM file alignments
0
gravatar for rooneyl
3.9 years ago by
rooneyl0
United States
rooneyl0 wrote:

I prepared a small RNA (28-34nt) library made up of ribosome protected RNA fragments.  I prepared my library for illumina sequencing by RT-PCR and adaptor ligation. 

Results were excellent with ~3-4M reads per each indexed data set, six total, very high QC results.

After adaptor trimming and size selection of >28nt; Bowtie alignment of my reads to a rabbit genome (oryCun2) gives me 53% alignment with 51% aligned >1 time.  This sample is a mixture of rabbit and human as it is a cell free translation reaction of huCFTR.  Perhaps some of the alignments are missed at the intron exon boundaries.

If I display these BAM files on the UCSC main will I be looking at all possible alignments or a single best alignment for each read?

galaxy • 907 views
ADD COMMENTlink modified 3.9 years ago by Jennifer Hillman Jackson25k • written 3.9 years ago by rooneyl0
0
gravatar for Jennifer Hillman Jackson
3.9 years ago by
United States
Jennifer Hillman Jackson25k wrote:

Hello,

It depends on how you set up the parameters for the mapping. At default, all hits that are within the same range of passing the mapping criteria will be retained. 

If you just want the best hits, you can always filter using SAMTools by mapping flags, according to the criteria you set.

Hopefully this helps, Jen, Galaxy team

ADD COMMENTlink written 3.9 years ago by Jennifer Hillman Jackson25k

Hi Jennifer, Yes that helps, can you tell me what a good minimum MAPQ quality score would be for a perfect match. Is the screen shot attached the correct location to do this? Can I leave the other categories blank? Is there another step after running this SAMtools? LeeAnn

ADD REPLYlink written 3.9 years ago by rooneyl0

Hello,

I like this summary for reference:
http://maq.sourceforge.net/qual.shtml

Best hits that are nearly the same and a single best hit per read are different things, and teasing these apart is not possible for large, short read, datasets as far as I know. This is why some software tools specifically avoid simple, absolute, abundance counts. And it is true that you may get a slightly stronger match, but not significantly so, to a region that biologically the read was not derived from. Retaining all above a threshold (30 is reasonable) is often the best path. For exacts, you can go higher, but will lose content and risk biasing the results.

That said, you will almost certainly have some reads that will map with identical quality to multiple locations. The shorter the read, the more often this will occur. All above the threshold you set should probably be considered valid, but you can always ignore hits to pseudogenes or repetitive regions and such by using a reference annotation dataset that excludes these in downstream steps. 

Paired-end data helps to reduce the noise (this adds in an additional confidence layer - via properly paired mapped reads), but there can still be some cross-over hits. It is probably more important to look at the coverage for each region of interest toward the end of analysis, rather than to worry about multi-mapping reads at the beginning.

I realize this does not give you exactly the answer that you want, but perhaps explains the data a bit more.

Best, Jen, Galaxy team

 

 

ADD REPLYlink written 3.9 years ago by Jennifer Hillman Jackson25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 181 users visited in the last hour