Count intervals of non-uniquely mapped reads overlapping the genome

Question: Count intervals of non-uniquely mapped reads overlapping the genome

3.9 years ago by

United States

Hello everyone,

I have some SAM/BAM files containing the alignments of small RNA-seq reads to mm9 that are created by BWA. I'm interested in calculating where they are mapped to in the genome. I noticed that there are a lot of reads that are mapped to multiple loci (multi-mappers) in genome. Therefore, I first separated the unique-mappers from multi-mappers and counted intervals of unique-mappers overlapping different regions of the genome using the bedtool in the galaxy. Now here comes the issue: as multi-mappers are mapped non-uniquely to various regions in the genome, if I simply use the same method as the uniquely-mappers, I will overestimate the number of counts, how can I count the normalized intervals of multi-mappers overlapping different regions of the genome? Is there any tool in the galaxy or R package that I can use? Thank you for your help in advance!

bed galaxy bam • 2.3k views

ADD COMMENT • link •

modified 3.9 years ago • written 3.9 years ago by sunkissbean • 0

The procedure you need generally depends on the biology you want to explore - eg http://www.rna-seqblog.com/small-rna-read-alignment-for-accurate-quantification/ shows that different levels of mapping stringency give results with different uses - there's no one-size-fits-all approach but low multiplicity seems useful for t-rna whereas unique mappings are good if you only care about microRNA.

I don't think there are any specific tools in Galaxy for this kind of exploration but if you can find a working R script it can probably be easily turned into a Galaxy tool - eg using the tool factory.

ADD REPLY • link written 3.9 years ago by fubar ♦ 1.1k

3.9 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

For RNA-seq data, BWA is probably not the optimal mapping tool, if working with a genome that contains spliced transcripts. Tophat/2 would be a better choice.

Have you examined the Tuxedo suite to see if it will meet your analysis needs? Cufflinks may provide you with the desired results. See the tool group "NGS: RNA-seq". Help for the tool package is here:
Section 2.13.2 http://wiki.galaxyproject.org/Support

Best, Jen, Galaxy team

ADD COMMENT • link written 3.9 years ago by Jennifer Hillman Jackson ♦ 25k

3.9 years ago by

sunkissbean • 0

United States

sunkissbean • 0 wrote:

Thank you for your reply, Jen! Mine is 50-bp single-end sequences from small RNAs rather than RNA-seq. I mapped my sequences to mm9 using BWA and more than 80% of the reads can be mapped to the genome while the non-mappers cannot be mapped to the genome using Tophat2. The reason why I didn't use the Tuxedo suite is because for quantification of small RNA reads, RPKM is usually not a good option due to their small size (20-30 nt).

ADD COMMENT • link written 3.9 years ago by sunkissbean • 0

Sorry, I didn't read the question carefully enough! Ross's advice above is good - and these tools are not on the public Main Galaxy instance. There are some other wrapped tools for Galaxy that may be a fit. These are for use in a cloud Galaxy. See the Tool Shed. Thanks! Jen, Galaxy Team

http://usegalaxy.org/toolshed
http://usegalaxy.org/cloud

ADD REPLY • link written 3.9 years ago by Jennifer Hillman Jackson ♦ 25k

3.9 years ago by

sunkissbean • 0

United States

sunkissbean • 0 wrote:

Thank you all for your suggestions and advices! The link really shows the importance of dealing with the multi-mappers for accurate quantification of small RNA seq. I wonder whether I can use RSEM or eXpress, both of which can be found in the tool shed and seem to give more accurate quantification of multi-mappers by normalizing the number of times that they are mapped to the genome.

ADD COMMENT • link written 3.9 years ago by sunkissbean • 0

3.9 years ago by

sunkissbean • 0

United States

sunkissbean • 0 wrote:

Update: I tried RSEM on the cloud but somehow running the RSEM but always got an error message reading "tool error". I feel that the error probably came from when I prepared the reference as the reference file that I generated was very small (<1 MB). Specifically, I used the "RSEM prepare reference" function to generate the reference from RefSeq DNA fasta file. Then, I used the RSEM calculate expression to quantitate the expression level of small RNAs by using a SAM/BAM file that was generated by BWA and the reference file. I am really confused what went wrong now.

ADD COMMENT • link written 3.9 years ago by sunkissbean • 0

Please log in to add an answer.

Similar posts • Search »