Galaxy 101 Tutorial "Join" generating too much data

Question: Galaxy 101 Tutorial "Join" generating too much data

2.2 years ago by

Bob • 0

United States

Bob • 0 wrote:

I've been trying to follow the Galaxy 101 tutorial (see related topic for background). When I execute the "join" the job seems to run an excessively long time and generates hundreds of gigabytes of data.

The latest job was killed before completion, but at that time the files/sizes were:

dataset_17.dat ~933KB
dataset_18.dat ~17MB
dataset_19.dat ~50GB

Here's the "Job Command-Line" matching those file names:

python /scratch/Galaxy/galaxy/tools/filters/join.py /scratch/Galaxy/galaxy/database/files/000/dataset_17.dat /scratch/Galaxy/galaxy/database/files/000/dataset_18.dat 1 1 /scratch/Galaxy/galaxy/database/files/000/dataset_19.dat --index_depth=3 --buffer=50000000 --fill_options_file=/scratch/Galaxy/galaxy/database/jobs_directory/000/19/tmpYv55_l

Any ideas on what I might be doing wrong?

join 101 galaxy • 597 views

ADD COMMENT • link •

modified 2.2 years ago by Jennifer Hillman Jackson ♦ 25k • written 2.2 years ago by Bob • 0

I've had a user run into an issue like this when there were duplicate entries in the column being joined on. In that case, there was a column of gene names (these aren't actually unique) and a column of counts. Since the gene names weren't unique, file A and B would each have two copies, the merged file (C) would have 4, and every successive merge thereafter would double the number of merged lines. So double check whether what you're joining on is actually unique.

ADD REPLY • link written 2.2 years ago by Devon Ryan • 1.9k

Thanks Devon.

I don't have enough experience yet to know if that's the problem or not. I'm trying to run the first "Galaxy 101" example, and I just followed the steps in that tutorial. I've tried it a few times (which is why my files are now numbered 17, 18, and 19), and I've been more careful each time with the same result.

Do the input file sizes (17 and 18) look right to start with?

ADD REPLY • link modified 2.2 years ago • written 2.2 years ago by Bob • 0

Is this the tutorial you are running? https://github.com/nekrut/galaxy/wiki/Galaxy101-1

ADD REPLY • link written 2.2 years ago by Jennifer Hillman Jackson ♦ 25k

Yes. It's our old friend from the other topic. I started a new topic because the instructions seem to indicate that "Answers should ONLY be used to respond to the original question at the top of this page!", and I wasn't sure if that meant that new posts should be a new topic.

ADD REPLY • link written 2.2 years ago by Bob • 0

Similar posts • Search »