2.7 years ago by
Filter multi-hit alignments
Built-in tools are targetted at removing duplicates, not filters for these specifically. But there is another method.
- Remove the header from the SAM file to create a headerless dataset and also place it in another dataset. The Select tool can be used for both. Run it twice with Matching/Not Matching and the regular expression:
Use the Group tool to count the number of times the query sequence (c1) is present in the output.
Use Filter to retrieve query sequences with a count greater than "1".
Use Join two Datasets side by side on a specified field between the output of 3 with the output of 1 that contains the alignment lines. The join column for each input is the field containing the query sequence name.
(optional) Place the SAM header dataset from 1 at the top of the output from 4 using the tool "Concatenate". Adjust metadata (datatype, database) using the pencil icon > Edit Attributes.
(optional) Once a procedure is created to your liking, extract a workflow for re-use to avoid the need to execute the tools 1x1 in the future. https://wiki.galaxyproject.org/Learn/AdvancedWorkflow
Compare Alignments from different mapping tools
Tools in the groups Operate on Genomic Intervals, SAMTools, Picard, and BAMTools are good choices to generate overview of alignments in common, overlapping in various ways, and overall mapping statistics. Check each tool form for the require input formats (SAM files may need to be submitted with headers removed, or Sorted BAM input is required (use Sort BAM dataset or SortSam even if the dataset name states the output is already sorted).
Visualize the different alignment outputs for a few known genes (especially those with novel isoforms detected) plus other regions outside of known annotation regions, then make a judgment call. Include BAM/SAM datasets along with an annotation track or two (or more!) in GTF or BED format in the visualization. Try Trackster within Galaxy to start with to see if it meets your needs, found under the graph icon per dataset, or under the masthead link Visualizations (this viewer works with all genomes, including custom reference genomes that have been indexed as a Custom Build: https://wiki.galaxyproject.org/Support#Custom_reference_genome
External viewer options may also be available depending on the database assignment and will be listed with links on the BAM/SAM dataset's expanded view (click on the dataset name to expand). Note that some external visualization tools require BAM format and nearly all require SAM format with intact headers.
Thanks, Jen, Galaxy team