Is there a possibility to speed up tophat2?
My Tophat2 runs with 8 Cores but after a while of computing segment_juncs runs on single core and takes more than 24 hours. Is there anything that I can change in my setting to speed up this process?
Is there a possibility to speed up tophat2?
My Tophat2 runs with 8 Cores but after a while of computing segment_juncs runs on single core and takes more than 24 hours. Is there anything that I can change in my setting to speed up this process?
Hello,
Using a reference annotation file would almost certainly help (if available), as would using less sensitive parameters. But you would have to decide if this still produces the desired results.
For sensitivity, start by examining if you are using realignment or not, and the settings for edit distance, anchor length, and mismatch in anchor length and how they deviate from the defaults. Much depends on the target reference genome, and the input reads, so if these were modified, go back to default to benchmark and proceed toward more sensitive runs from there.
This all keeps in mind that the tools are tuned for mammalian genomes - if you are working with something different, publications or group discussions that focus on your target genome will provide the best feedback for suggested modifications to optimize and balance discovery vs performance.
The primary publication also has some metrics for run times under specific conditions for reference:
http://genomebiology.com/2013/14/4/R36
Others are welcome to add comments.
Best, Jen, Galaxy team
Thanks,
I map against the bosTau6 (cow) and I also have a gff annotation file as reference.
I also printed the command I'm using. Maybe you see setting that can lead to major speed up. My fastq file contains 27690231 sequences, which is pretty big.
Here is the command I used:
tophat2 --num-threads 8 --read-mismatches 2 --read-edit-dist 2 --read-realign-edit-dist 1000 -a 25 -m 0 -i 70 -I 500000 -g 5 --min-segment-intron 50 --max-segment-intron 500000 --segment-mismatches 1 --segment-length 25 --library-type fr-firststrand --max-insertion-length 3 --max-deletion-length 3 -G /opt/galaxy/galaxy-dist/database/files/000/dataset_324.dat --coverage-search --min-coverage-intron 50 --max-coverage-intron 20000 --b2-sensitive /opt/galaxy/galaxy-dist/tool-data/genome/bosTau6/bowtie2_index/bosTau6 /opt/galaxy/galaxy-dist/database/files/000/dataset_29.dat
best Jochen
Hello,
The manual that describes each parameter choice is here, to review:
http://ccb.jhu.edu/software/tophat/tutorial.shtml#toph
http://ccb.jhu.edu/software/tophat/manual.shtml#toph
I don't know how long your reads are (coverage search could be used or not used, and compared to see if worth it) or how sensitive you need the run to be (the Bowtie setting of "sensitive" was chosen, you could try the fast or very fast option, and again compare to see if any gain is worth the extra compute time).
The rest I would most likely leave as-is, although you should also review the setting versus the manual and make choices and test. The value of this type of exercise increases with the more data of the same type/genome you have to process.
If you need to map Galaxy form settings to the line command settings, just examine the tophat2 xml tool wrapper. Each option is annotated and/or commented with both the line command option identifier and the tool form label. You can do this either locally on your instance, or view the repository tip files in the Tool Shed: http://toolshed.g2.bx.psu.edu/
Good luck, Jen, Galaxy team