Question: Fasta_To_Tabular.Py Slowness
0
gravatar for Rasmus Ory Nielsen
9.4 years ago by
Rasmus Ory Nielsen30 wrote:
Hi Galaxy Team, I've found that fasta_to_tabular.py is very slow with big sequences, e.g. ~4 minutes for a single 5MB sequence. The patch below makes the running time go from minutes to seconds for such a sequence. Mind you, this is my first line of python, so there may be a smarter way. Best regards, Rasmus Ory Nielsen +++ fasta_to_tabular.py 2009-07-18 17:22:49.544611000 +0200 @@ -34,7 +34,7 @@ fasta_seq = '' else: if line: - fasta_seq = "%s%s" % ( fasta_seq, line ) + fasta_seq += line if fasta_seq: out.write( "%s\t%s\n" %( fasta_title[ 1:keep_first ], fasta_seq ) )
galaxy • 1.1k views
ADD COMMENTlink modified 9.4 years ago by James Casbon370 • written 9.4 years ago by Rasmus Ory Nielsen30
0
gravatar for Bob Harris
9.4 years ago by
Bob Harris190
United States
Bob Harris190 wrote:
I suspect an additional improvement would be seen by keeping fasta_seq as a list of strings, using fasta_seq.append(line), and the catenating them together with "".join when it's time to output. Mind you, I haven't tested that though. Bob H
ADD COMMENTlink written 9.4 years ago by Bob Harris190
Think about memory when you have large files...
ADD REPLYlink written 9.4 years ago by Greg Von Kuster840
Hi all; Rasmus: [...] Bob: Greg: The memory usage shouldn't be any different than the current implementation since an entire sequence is read into memory, and then written to the output file. Bob's list/join approach is the standard way to quickly do this, although in Python 2.5 and above the concatenation approach is almost as good. The Python wiki has a good summary of this common speed-up improvement: http://wiki.python.org/moin/PythonSpeed/PerformanceTips#StringConcaten ation Definitely worth adding. If memory is a problem the code could be improved to read in a specified number of lines and write them incrementally to the output file instead of breaking at sequence records. Brad
ADD REPLYlink written 9.4 years ago by Brad Chapman240
Hi Greg, I was in the middle of writing a mail with a message very similar to what Brad Chapman just sent. Therefore I will just send my time comparisons to back up my initial mail. At the moment it is not impossible, but at least you got to have lots of time, if you want to convert a few large sequences. Below is two tests I just ran. Both tests convert a single sequence comparing the original and the patched version (+= approach) of fasta_to_tabular.py. Thanks. Best regards, Rasmus Ory Nielsen [roni@galaxy]$ ls -lh test.fa -rw-rw-r-- 1 roni roni 5.9M 2009-07-20 15:24 test.fa [roni@galaxy]$ time ./fasta_to_tabular.py test.fa test.tab 0 real 0m0.214s user 0m0.139s sys 0m0.024s [roni@galaxy]$ time ./fasta_to_tabular.py.orig test.fa test.tab.orig 0 real 2m37.114s user 1m53.467s sys 0m43.531s And with a bigger file: [roni@galaxy]$ ls -lh test2.fa -rw-rw-r-- 1 roni roni 12M 2009-07-20 15:33 test2.fa [roni@galaxy]$ time ./fasta_to_tabular.py test2.fa test2.tab 0 real 0m0.413s user 0m0.264s sys 0m0.050s [roni@galaxy]$ time ./fasta_to_tabular.py.orig test2.fa test2.tab.orig 0 real 13m30.621s user 9m18.316s sys 4m12.081s ________________________________________ Fra: Greg Von Kuster [ghv2@psu.edu] Sendt: 20. juli 2009 14:44 Til: Bob Harris Cc: galaxy-user@bx.psu.edu; Rasmus Ory Nielsen Emne: Re: [galaxy-user] fasta_to_tabular.py slowness Think about memory when you have large files...
ADD REPLYlink written 9.4 years ago by Rasmus Ory Nielsen30
ADD REPLYlink written 9.4 years ago by James Casbon370
James, Thanks very much for originally reporting this issue, and we really apologize for the lack of response until now. Messages like yours are extremely important to us, and we make our best attempt at responding to them, and incorporating fixes on a timely basis. This is one of those times where we wish we had done thigns differently. I've opened the following issue in bitbucket and this fix is currently under way and will soon be available in the distribution and on our main server. Thanks James, Greg Von Kuster galaxy Development Team
ADD REPLYlink written 9.4 years ago by Greg Von Kuster840
2009/7/22 Greg Von Kuster <ghv2@psu.edu>: Great! Thanks Greg, I didn't want the overhead of maintaining my own fork! keep up the good work, James
ADD REPLYlink written 9.4 years ago by James Casbon370
James, Your contributed code has been used to replace the original versions of the fasta_filter_by_length.py and fasta_to_tabular.py files. These fixes have been pushed to the distribution as well. Thanks again for your contributions here, and please overlook our initial lack of response. I promise it won't happen again. Greg
ADD REPLYlink written 9.4 years ago by Greg Von Kuster840
Hello Rasmus, The fix for this should be pushed out to our public repo shortly, and available on our main site as well. I've opened the following ticket in bitbucket so you can "follow" it if you want. http://bitbucket.org/galaxy/galaxy-central/issue/112/fix- fasta_to_tabularpy-issues-resulting-in-slow Greg Von Kuster Galaxy Development Team
ADD REPLYlink written 9.4 years ago by Greg Von Kuster840
Hi Greg, This is great. Thanks. Best regards, Rasmus Ory Nielsen ________________________________________ Fra: Greg Von Kuster [ghv2@psu.edu] Sendt: 22. juli 2009 20:28 Til: Rasmus Ory Nielsen Cc: galaxy-user@bx.psu.edu Emne: Re: SV: [galaxy-user] fasta_to_tabular.py slowness Hello Rasmus, The fix for this should be pushed out to our public repo shortly, and available on our main site as well. I've opened the following ticket in bitbucket so you can "follow" it if you want. http://bitbucket.org/galaxy/galaxy-central/issue/112/fix- fasta_to_tabularpy-issues-resulting-in-slow Greg Von Kuster Galaxy Development Team
ADD REPLYlink written 9.4 years ago by Rasmus Ory Nielsen30
0
gravatar for James Casbon
9.4 years ago by
James Casbon370
James Casbon370 wrote:
2009/7/18 Rasmus Ory Nielsen <rasmus.nielsen@agrsci.dk>: I sent similar patches through months ago - they got ignored by the core team, unfortunately. cheers, James
ADD COMMENTlink written 9.4 years ago by James Casbon370
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 181 users visited in the last hour