I've been trying to follow the Galaxy 101 tutorial (see related topic for background). When I execute the "join" the job seems to run an excessively long time and generates hundreds of gigabytes of data.
The latest job was killed before completion, but at that time the files/sizes were:
- dataset_17.dat ~933KB
- dataset_18.dat ~17MB
- dataset_19.dat ~50GB
Here's the "Job Command-Line" matching those file names:
python /scratch/Galaxy/galaxy/tools/filters/join.py /scratch/Galaxy/galaxy/database/files/000/dataset_17.dat /scratch/Galaxy/galaxy/database/files/000/dataset_18.dat 1 1 /scratch/Galaxy/galaxy/database/files/000/dataset_19.dat --index_depth=3 --buffer=50000000 --fill_options_file=/scratch/Galaxy/galaxy/database/jobs_directory/000/19/tmpYv55_l
Any ideas on what I might be doing wrong?
I've had a user run into an issue like this when there were duplicate entries in the column being joined on. In that case, there was a column of gene names (these aren't actually unique) and a column of counts. Since the gene names weren't unique, file A and B would each have two copies, the merged file (C) would have 4, and every successive merge thereafter would double the number of merged lines. So double check whether what you're joining on is actually unique.
Thanks Devon.
I don't have enough experience yet to know if that's the problem or not. I'm trying to run the first "Galaxy 101" example, and I just followed the steps in that tutorial. I've tried it a few times (which is why my files are now numbered 17, 18, and 19), and I've been more careful each time with the same result.
Do the input file sizes (17 and 18) look right to start with?
Is this the tutorial you are running? https://github.com/nekrut/galaxy/wiki/Galaxy101-1
Yes. It's our old friend from the other topic. I started a new topic because the instructions seem to indicate that "Answers should ONLY be used to respond to the original question at the top of this page!", and I wasn't sure if that meant that new posts should be a new topic.