Question: Filter for sequencing reads mapped to multiple loci
1
gravatar for willtun1991
2.7 years ago by
willtun199110
United Kingdom
willtun199110 wrote:

Hello

I have done both a hisat2 and tophat alignment, both yielding BAM files which I later converted to SAM files.

1) I need to filter my alignment outputs so I have a dataset showing only the mapped reads which mapped to MULTIPLE locations (i.e more than one location)... How many I do this?

2) Have you any suggestions on how to compare the outputs of hisat2 and tophat to see which is the more accurate aligner?

Will

ADD COMMENTlink modified 2.7 years ago by Jennifer Hillman Jackson25k • written 2.7 years ago by willtun199110
1
gravatar for Jennifer Hillman Jackson
2.7 years ago by
United States
Jennifer Hillman Jackson25k wrote:

Hello,

Filter multi-hit alignments

Built-in tools are targetted at removing duplicates, not filters for these specifically. But there is another method.

  1. Remove the header from the SAM file to create a headerless dataset and also place it in another dataset. The Select tool can be used for both. Run it twice with Matching/Not Matching and the regular expression:

^@

  1. Use the Group tool to count the number of times the query sequence (c1) is present in the output.

  2. Use Filter to retrieve query sequences with a count greater than "1".

  3. Use Join two Datasets side by side on a specified field between the output of 3 with the output of 1 that contains the alignment lines. The join column for each input is the field containing the query sequence name.

  4. (optional) Place the SAM header dataset from 1 at the top of the output from 4 using the tool "Concatenate". Adjust metadata (datatype, database) using the pencil icon > Edit Attributes.

  5. (optional) Once a procedure is created to your liking, extract a workflow for re-use to avoid the need to execute the tools 1x1 in the future. https://wiki.galaxyproject.org/Learn/AdvancedWorkflow

Compare Alignments from different mapping tools

  1. Tools in the groups Operate on Genomic Intervals, SAMTools, Picard, and BAMTools are good choices to generate overview of alignments in common, overlapping in various ways, and overall mapping statistics. Check each tool form for the require input formats (SAM files may need to be submitted with headers removed, or Sorted BAM input is required (use Sort BAM dataset or SortSam even if the dataset name states the output is already sorted).

  2. Visualize the different alignment outputs for a few known genes (especially those with novel isoforms detected) plus other regions outside of known annotation regions, then make a judgment call. Include BAM/SAM datasets along with an annotation track or two (or more!) in GTF or BED format in the visualization. Try Trackster within Galaxy to start with to see if it meets your needs, found under the graph icon per dataset, or under the masthead link Visualizations (this viewer works with all genomes, including custom reference genomes that have been indexed as a Custom Build: https://wiki.galaxyproject.org/Support#Custom_reference_genome External viewer options may also be available depending on the database assignment and will be listed with links on the BAM/SAM dataset's expanded view (click on the dataset name to expand). Note that some external visualization tools require BAM format and nearly all require SAM format with intact headers.

Thanks, Jen, Galaxy team

ADD COMMENTlink written 2.7 years ago by Jennifer Hillman Jackson25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 169 users visited in the last hour