Extension Of Read Length

Question: Extension Of Read Length

5.2 years ago by

Japan

Dear all, I am working on an MNAse-Seq experiment with 50bp single end reads. To identify nucleosome positions, I read that one needs to extend the single reads to approximately the length of nucleosome protected DNA, being approximately 150bp. Is there a way in Galaxy to extend 50bp reads to 150bp length, lets say from a .BAM file with mapped reads? Of course any other comment on this topic is much appreciated! Thank you very much, Tobias -- Tobias Hohenauer, PhD GCNA, Disease Mechanism Research Core RIKEN Brain Science Institute 2-1 Hirosawa, Wako-shi 351-0198 Japan

galaxy • 1.3k views

ADD COMMENT • link •

modified 5.2 years ago by Jennifer Hillman Jackson ♦ 25k • written 5.2 years ago by Tobias Hohenauer • 20

5.2 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hi Tobias, In general, you can use *'**NGS: Picard (beta) -> SAM to FASTQ'* to extract sequences (convert BAM > SAM first), but this tool does not add in extra sequence based off the reference genome (or pad the associated quality scores, etc.). I don't know of a Galaxy wrapped tool that does this, but you might check the Tool Shed, or other public Galaxy servers. Others reading this post may also have advice. Now, going from *BAM* -> coordinates (bed/interval) *->* *FASTA* sequence is possible a few ways. The general idea is that the coordinates are manipulated to extend the mapped footprint and then the sequence is extracted from the reference genome. Any content novel in the original sequence is lost, but maybe this still has some utility for you. The two methods below show how to do this, with the 2nd being simpler, if the genome is at UCSC. There are other ways to get flanking sequence, merge/cluster, etc. (see tools in group 'Operate on Genomic Intervals') but below are the most direct methods per-sequence to simply extend. And if you need to filter down multi-mapped data, use the tool ' NGS: SAM Tools -> Filter SAM' (converting to/from SAM from BAM as needed). *1st method, works for any genome, include a custom reference genome:* 1 - convert 'NGS: SAM Tools ->BAM-to-SAM' 2 - convert SAM to interval with 'NGS: SAM Tools -> Convert SAM' or convert to bed with 'BEDTools -> Convert from BAM to BED' 3 - split the file into two: one representing the (+) strand alignments, one the (-) using the tool ' Filter and Sort -> Filter' 4 - adjust the start or end coordinate to extend the alignment footprint as wanted using the tool 'Text Manipulation -> Compute'. Remember that for negative stranded coordinates, the "start" is really where the end of the sequence aligned and "end" is where the start of the sequence aligned - interval files report coordinates with respect to (+) strand, smallest -> largest. http://wiki.galaxyproject.org/Learn/Datatypes#Interval 5 - cut out the columns to create a standard interval file again, swapping in the new coordinates. Click on the pencil icon to make attribute assignment for columns and to assign a reference genome as needed - this information is required by the next tool. 6 - get the fasta sequence by using the tool 'Fetch Sequences -> Extract Genomic DNA' 7 - merge all fasta results together with the tool 'Text Manipulation -> Concatenate datasets' 8 - if you need fastq format, you can pad out quality scores and create that with the tool 'NGS: QC and manipulation -> Combine FASTA and QUAL' *2nd method, if the reference genome is at UCSC:* 1 - convert 'BEDTools -> Convert from BAM to BED' 2 - click on the "view at UCSC main" link for the dataset 3 - once at UCSC Browser, the data will show up as a custom track, by default named "User Track" in the top track group. Click on the track name - it will take you to the track controls and focus the browser on this track. 4 - in the top blue menu bar, click on "Tools -> Table Browser". This track will now be pre-loaded in the form with all options probably set as you want them (this user track is selected and "region" is "genome") - except for one - change "output format" from "BED" to be "sequence 5 - confirm that the "Galaxy" box is checked, and click on "get output" 6 - the next form has options for extending the sequence at 5' and/or 3' ends, all in one go, adjust as you want 7 - click on "Send query to Galaxy" and the dataset will load back into the working history 8 - the fasta can be converted to fastq as in the 1st method, step #8 Hopefully some of this is helpful! Jen Galaxy team -- Jennifer Hillman-Jackson http://galaxyproject.org

ADD COMMENT • link written 5.2 years ago by Jennifer Hillman Jackson ♦ 25k

Similar posts • Search »