I used the Megablast function (in the NGS: Mapping\ROCHE-454\) to
analyze my FASTA sequences against nt database and it worked fine for
me. However, it generated 56,804 hits although my query has only 1000
sequences. I am wondering is there any way to suppress the number of
reported alignments to just one best hit per sequence? (In the local
BLAST there are parameters such as -K1 -v 1 -b 1 to do so, but I can't
find similar options in Galaxy).
When running Megablast, filtering by identity or evalue can help
the hits (the default values are all fairly permissive, if you are
performing the query vs the same species target genome and the query
been filtered for base calling quality). Filtering out low-complexity
would also be a big help, as a guess, considering the number of hits
generated from your initial data.
There is also the "Parse blast XML output" tool. Modifying the data
interval format would allow the use of the "Operate on Genomic
-> Cluster the intervals of a dataset". This is based on coverage, if
that is one of your criteria (could be, if the threshold for identity
a range you consider to be candidate choices for "best"). Identity &
coverage are commonly combined to identify "best", but this is just a
suggestion. The same type of logic could be used with top scoring
matches combined with coverage (would likely be similar as using
alone, if the identity is set to be high).
The idea to add a filter for "single best" is a good one, but has some
complexity associated with it. I will pass it along to the team as an
enhancement request to consider.
Hopefully this helps!