Question: Filtering Lastz Output To *Best* Match Only
0
gravatar for Brant Faircloth
8.7 years ago by
Brant Faircloth30 wrote:
Apologies in advance if this isn't the best place to ask the following: I've RTFM-d for lastz-1.02.00 but may have missed it - is there currently any way to filter output from a lastz run such that _only_ the best match is returned (i'm currently doing this externally, but that seems inefficient if there's an option to do so). thanks in advance, b < * ) (_ \\ _ ||
• 809 views
ADD COMMENTlink modified 8.7 years ago by Anton Nekrutenko1.7k • written 8.7 years ago by Brant Faircloth30
0
gravatar for Anton Nekrutenko
8.7 years ago by
Penn State
Anton Nekrutenko1.7k wrote:
Brant: Forwarding Bob Harris, the creator of lastz... a. Anton Nekrutenko http://nekrut.bx.psu.edu http://usegalaxy.org
ADD COMMENTlink written 8.7 years ago by Anton Nekrutenko1.7k
Thanks - apparently, i didn't RTFM enough to see the lastz mailing list info right at the top of the README ; ). best, b < * ) (_ \\ _ ||
ADD REPLYlink written 8.7 years ago by Brant Faircloth30
Howdy, Brant, I'm at a conference this week and my internet connectivity is spotty. Short answer is "no", lastz doesn't provide a best-only filter. I am not sure what your input files are (one-to-one, many-to-many, one-to- many?). For one-to-one you could use sort and head to solve this (at command line-- not sure specifically how you would do this in galaxy). For many-to-one, such as mapping a large number of reads to a genome, it's a little more difficult because you want to sort to occur separately per read. It isn't an overly difficult thing to do though. ... laptop battery going down. Gotta sign off. Bob H
ADD REPLYlink written 8.7 years ago by Bob Harris190
hi bob, no problem, thanks for the fast response. it's basically many to many right now, as long my RAM holds and i don't go one-to-many. i've basically been parsing the lastz output to build a dictionary like so: { read1: { score1:match1, score2:match2 }, read2: { score1:match1, } } where score1…scoreN is an integer, then: for read in matches: m = matches[read] bestMatch = m[max(m.keys())] … do stuff … this seems fast enough with a few tens of thousands of lines returned from lastz. i basically just wanted to see if i had missed anything and maybe give the old +1 for --bestmatch as a potential filter parameter in the future ; ). (given enough time, i'll poke around in the source and see what i can do with my meager C skills) thanks again for the extremely rapid response! enjoy your conference... best, b < * ) (_ \\ _ ||
ADD REPLYlink written 8.7 years ago by Brant Faircloth30
Howdy, Brant, The method you describe below is reasonable. You could potentially run into memory problems if you have too many reads, or if you have too many hits for a particular read (python dictionaries, worst case, contain three times as many hash-bins as keys). You could avoid this by only keeping a dictionary of read-name to best-hit. Depending on your scoring scheme, highest score may or may not be the right choice for "best". For example, we have used the number of matches as the score for such comparisons. Also, depending on your application, you might want the "best" hit only in those cases where it is significantly better than the second best hit. Bob H
ADD REPLYlink written 8.7 years ago by Bob Harris190
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 147 users visited in the last hour