Filtering Lastz Output To *Best* Match Only

Question: Filtering Lastz Output To *Best* Match Only

8.7 years ago by

Apologies in advance if this isn't the best place to ask the following: I've RTFM-d for lastz-1.02.00 but may have missed it - is there currently any way to filter output from a lastz run such that _only_ the best match is returned (i'm currently doing this externally, but that seems inefficient if there's an option to do so). thanks in advance, b < * ) (_ \\ _ ||

• 809 views

ADD COMMENT • link •

modified 8.7 years ago by Anton Nekrutenko ♦ 1.7k • written 8.7 years ago by Brant Faircloth • 30

8.7 years ago by

Anton Nekrutenko ♦ 1.7k

Penn State

Anton Nekrutenko ♦ 1.7k wrote:

Brant: Forwarding Bob Harris, the creator of lastz... a. Anton Nekrutenko http://nekrut.bx.psu.edu http://usegalaxy.org

ADD COMMENT • link written 8.7 years ago by Anton Nekrutenko ♦ 1.7k

Thanks - apparently, i didn't RTFM enough to see the lastz mailing list info right at the top of the README ; ). best, b < * ) (_ \\ _ ||

ADD REPLY • link written 8.7 years ago by Brant Faircloth • 30

Howdy, Brant, I'm at a conference this week and my internet connectivity is spotty. Short answer is "no", lastz doesn't provide a best-only filter. I am not sure what your input files are (one-to-one, many-to-many, one-to- many?). For one-to-one you could use sort and head to solve this (at command line-- not sure specifically how you would do this in galaxy). For many-to-one, such as mapping a large number of reads to a genome, it's a little more difficult because you want to sort to occur separately per read. It isn't an overly difficult thing to do though. ... laptop battery going down. Gotta sign off. Bob H

ADD REPLY • link written 8.7 years ago by Bob Harris • 190

hi bob, no problem, thanks for the fast response. it's basically many to many right now, as long my RAM holds and i don't go one-to-many. i've basically been parsing the lastz output to build a dictionary like so: { read1: { score1:match1, score2:match2 }, read2: { score1:match1, } } where score1scoreN is an integer, then: for read in matches: m = matches[read] bestMatch = m[max(m.keys())] do stuff this seems fast enough with a few tens of thousands of lines returned from lastz. i basically just wanted to see if i had missed anything and maybe give the old +1 for --bestmatch as a potential filter parameter in the future ; ). (given enough time, i'll poke around in the source and see what i can do with my meager C skills) thanks again for the extremely rapid response! enjoy your conference... best, b < * ) (_ \\ _ ||

ADD REPLY • link written 8.7 years ago by Brant Faircloth • 30

Howdy, Brant, The method you describe below is reasonable. You could potentially run into memory problems if you have too many reads, or if you have too many hits for a particular read (python dictionaries, worst case, contain three times as many hash-bins as keys). You could avoid this by only keeping a dictionary of read-name to best-hit. Depending on your scoring scheme, highest score may or may not be the right choice for "best". For example, we have used the number of matches as the score for such comparisons. Also, depending on your application, you might want the "best" hit only in those cases where it is significantly better than the second best hit. Bob H

ADD REPLY • link written 8.7 years ago by Bob Harris • 190

Similar posts • Search »