Question: how to sort reads by read length
0
gravatar for c.zijlstra
23 days ago by
c.zijlstra0
c.zijlstra0 wrote:

I would like to sort reads in a bam file by read length. Which tool on the galaxy website can i use?

sort sam galaxy bam • 47 views
ADD COMMENTlink modified 23 days ago by Jennifer Hillman Jackson25k • written 23 days ago by c.zijlstra0
0
gravatar for Jennifer Hillman Jackson
23 days ago by
United States
Jennifer Hillman Jackson25k wrote:

Hello,

The bam datatype in Galaxy means something specific with respect to format and content: A coordinate sorted BAM dataset. Other bam datatypes are available but tools (most if not all) will restrict available inputs to a dataset with a bam datatype assigned. This avoids many usage errors/tool failures due to unexpected sorting.

BAM datatypes are described in the 18.01 release notes here, scroll down to the section named "New BAM datatypes": https://docs.galaxyproject.org/en/master/releases/18.01_announce.html

There isn't a tool to sort by the read lengths. Coordinate-sorted BAM datasets are the expected input bam format for most tools. The few that require queryname sorted BAMs have options on the tool form to queryname sort the data for processing (the original input is still expected to be in a coordinated sorted input, with the datatype bam assigned). Assigning the bam dataset to data that is not coordinate sorted will result in an error or warning, and if just a warning, expect downstream tools to fail when using that input.

Options:

  • BAM data can be sorted by either queryname or start coordinate with the tool: SortSam sort SAM/BAM dataset (Galaxy Version 2.18.2.1).
  • BAM data can be filtered by read length with the tool: BAM filter Removes reads from a BAM file based on criteria (Galaxy Version 0.5.9)
  • Generate a basic summary of read lengths with the tool: FastQC Read Quality reports (Galaxy Version 0.72). Note this is a sample of the first 200k sequences or so, not the complete dataset.
  • If you want to do something else with the data, it can be converted to interval format and manipulated from there. The steps would involve a workflow such as: BAM-to-SAM > Convert SAM to interval > Compute (subtract start from end for read length) > Sort data in ascending or descending order on the new length column.

There are a few ways to get the data into a tabular format and manipulate it. SAM format is essentially a tabular format once the header is removed and any of the tools that work directly with tabular input could be used (the Text Manipulation tool group includes most but also see Datamash, Filter and Sort, and Join, Subtract and Group.

Thanks! Jen, Galaxy team

ADD COMMENTlink written 23 days ago by Jennifer Hillman Jackson25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 115 users visited in the last hour