Keep only the best match after Megablast

Question: Keep only the best match after Megablast

2.6 years ago by

Dear all,

Is there a way, after Megablast, to filter for each read the best match? In my case, I ran a Megablast and asked the tool to report identies above 97%. So, for each read, I got a big number of alignments, with a list of all sequences deposited in Genbank with 97 to 100% identity with the read. So, in some cases, I have a read corresponding to Homo sapiens, and I get like 2000 matches with Homo sapiens deposited sequences with 100% homology, then antoher 2000 sequences with Gorilla gorilla deposited secuances with 99% homology, then another 1000 with Chimpanzee deposited sequences with 98% homology. In this case, I would like to keep only my best match, Homo sapiens, and only one, so that I can summarize data and compare abundancies with other vertebrates from other reads. In this case, if I cannot filter, I summarize and will obtain 2000 Homo sapiens, 2000 Gorilla gorilla, and 1000 Chimpanzee, when I actually have only 1 Homo sapiens (or several Homo sapiens if I can take into account different haplotypes, but I am not sure I'll be able to do this). For other reads, two different species of a same genus have an equivalent match, so in these cases I would like to keep only one match corresponding to the genus. In other cases, the best match has 98% identity. I would like to find a way to automatically clean all my Megablast file to allow obtaining all of this.

Many thanks for your help

megablast • 683 views

ADD COMMENT • link •

modified 2.6 years ago by Jennifer Hillman Jackson ♦ 25k • written 2.6 years ago by etiennewalex • 30

2.6 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

Using the Filter or Select tools to choose the hits with the highest percent identity/coverage will work in many cases, but not all. This is a universal issue faced when attempting to derive a top hit, in Galaxy or otherwise.

The best hit may in some cases difficult to isolate - there is a statistical tie, etc. For these, you will need to pick the "best" based on other criteria (this may by necessity be an arbitrary choice). There is no automatic way to filter this way, but Galaxy can make use of imported/manipulated text files to perform other operations to filter output (Compare Two Datasets, Join Two datasets). Other Text Manipulation tools are often helpful when filtering data.

Once you have a method worked out, save it into a workflow for reuse.

Thanks, Jen, Galaxy team

ADD COMMENT • link written 2.6 years ago by Jennifer Hillman Jackson ♦ 25k

Thank you Jen,

Actually, I thought that it was possible, with Select tool, to write something like: for identical names in Column 1 (column with sequence ID), keep only the line with the max number in column 2 (column with %identities). But I don't know how would be the syntax for this. Is this feasible witrh Select or Filter tools? I've seen other programms which keep only one hit per sequence, but I am interested in dong it into Galaxy.

Thanks for your help

ADD REPLY • link written 2.6 years ago by etiennewalex • 30

Similar posts • Search »