Dear all,
Is there a way, after Megablast, to filter for each read the best match? In my case, I ran a Megablast and asked the tool to report identies above 97%. So, for each read, I got a big number of alignments, with a list of all sequences deposited in Genbank with 97 to 100% identity with the read. So, in some cases, I have a read corresponding to Homo sapiens, and I get like 2000 matches with Homo sapiens deposited sequences with 100% homology, then antoher 2000 sequences with Gorilla gorilla deposited secuances with 99% homology, then another 1000 with Chimpanzee deposited sequences with 98% homology. In this case, I would like to keep only my best match, Homo sapiens, and only one, so that I can summarize data and compare abundancies with other vertebrates from other reads. In this case, if I cannot filter, I summarize and will obtain 2000 Homo sapiens, 2000 Gorilla gorilla, and 1000 Chimpanzee, when I actually have only 1 Homo sapiens (or several Homo sapiens if I can take into account different haplotypes, but I am not sure I'll be able to do this). For other reads, two different species of a same genus have an equivalent match, so in these cases I would like to keep only one match corresponding to the genus. In other cases, the best match has 98% identity. I would like to find a way to automatically clean all my Megablast file to allow obtaining all of this.
Many thanks for your help