Question: How To Use Collapsed Sequence Files In Mapping And Displaying
0
gravatar for Jun Lu
6.6 years ago by
Jun Lu10
Jun Lu10 wrote:
I found that there is a "collapse" tool under FASTA manipulation, which will significantly shorten mapping time with bowtie with small RNA reads that tend to have many reads of exact length and sequence after clipping adaptors. The tool generates new names for each unique sequence read with a number indicating the number of times (or occurrences) the unique sequence has appeared in the data. The question is, after mapping with Bowtie, how can I regain this "occurrence" information when displaying in Genome Browser? The current setting will only show one mapped read for each unique sequence, no matter how many times this unique sequence has occurred. Should I write a custom code to expand the resulting sam file based on the occurrences? All runs were executed on the galaxy main server. Any suggestion is appreciated. Jun
alignment bowtie • 2.0k views
ADD COMMENTlink modified 6.6 years ago by Jennifer Hillman Jackson25k • written 6.6 years ago by Jun Lu10
0
gravatar for Jennifer Hillman Jackson
6.6 years ago by
United States
Jennifer Hillman Jackson25k wrote:
Hi Jun, There isn't an automatic way to interpret this 'count' number from the sequence identifier when visualizing a BAM/SAM file, but it can be done in a BED file with some text manipulation. Note: BED data does not contain sequence data (BAM/SAM data does). Just something to be aware of when planning visualization priorities. If you want to zoom to the nucleotide/sequence level and see sequence data in your track, then this method is probably not the right choice. If you do choose to do this, after first converting BAM/SAM to Interval, the count could be placed into the 'score' attribute of a BED dataset. BED data displays at UCSC in shades of grey based on score values. See column #5 "score" here: http://genome.ucsc.edu/FAQ/FAQformat.html#format1 The basic idea would be use tools from the tool group "Text Manipulation" to manipulate the data. The general path would be the following (tune as needed): Starting with an Interval file: - Parse out the count data from the sequence name with "Convert delimiters to TAB" by "Dashes" to isolate the count from the latter half of the first column (sequence identifier). This new column of data will become your "score" column. - Optional. You may want/need to perform a calculation on the 'score' value to make it fit the 0-1000 grey scale that UCSC offers. To do so, use "Compute" and your own scaling expression. - "Add column" to create a "name" column. A "." (dot) works as a NULL value. - "Cut" columns to create a BED format of 6 columns in the proper order: http://wiki.g2.bx.psu.edu/Learn/Datatypes#Bed - Click on pencil icon to 'Edit Attributes' to set datatype to ".bed" and save. Then, set/double check all 6 attributes and save. Finally, set database if this become unassigned during processing. Best wishes for your project, Jen Galaxy team -- Jennifer Jackson http://galaxyproject.org
ADD COMMENTlink written 6.6 years ago by Jennifer Hillman Jackson25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 171 users visited in the last hour