Question: Galaxy 101 Tutorial "Join" generating too much data
0
gravatar for Bob
2.2 years ago by
Bob0
United States
Bob0 wrote:

I've been trying to follow the Galaxy 101 tutorial (see related topic for background). When I execute the "join" the job seems to run an excessively long time and generates hundreds of gigabytes of data.

The latest job was killed before completion, but at that time the files/sizes were:

  • dataset_17.dat ~933KB
  • dataset_18.dat ~17MB
  • dataset_19.dat ~50GB

Here's the "Job Command-Line" matching those file names:

python /scratch/Galaxy/galaxy/tools/filters/join.py /scratch/Galaxy/galaxy/database/files/000/dataset_17.dat /scratch/Galaxy/galaxy/database/files/000/dataset_18.dat 1 1 /scratch/Galaxy/galaxy/database/files/000/dataset_19.dat --index_depth=3 --buffer=50000000 --fill_options_file=/scratch/Galaxy/galaxy/database/jobs_directory/000/19/tmpYv55_l

Any ideas on what I might be doing wrong?

join 101 galaxy • 597 views
ADD COMMENTlink modified 2.2 years ago by Jennifer Hillman Jackson25k • written 2.2 years ago by Bob0

I've had a user run into an issue like this when there were duplicate entries in the column being joined on. In that case, there was a column of gene names (these aren't actually unique) and a column of counts. Since the gene names weren't unique, file A and B would each have two copies, the merged file (C) would have 4, and every successive merge thereafter would double the number of merged lines. So double check whether what you're joining on is actually unique.

ADD REPLYlink written 2.2 years ago by Devon Ryan1.9k

Thanks Devon.

I don't have enough experience yet to know if that's the problem or not. I'm trying to run the first "Galaxy 101" example, and I just followed the steps in that tutorial. I've tried it a few times (which is why my files are now numbered 17, 18, and 19), and I've been more careful each time with the same result.

Do the input file sizes (17 and 18) look right to start with?

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by Bob0

Is this the tutorial you are running? https://github.com/nekrut/galaxy/wiki/Galaxy101-1

ADD REPLYlink written 2.2 years ago by Jennifer Hillman Jackson25k

Yes. It's our old friend from the other topic. I started a new topic because the instructions seem to indicate that "Answers should ONLY be used to respond to the original question at the top of this page!", and I wasn't sure if that meant that new posts should be a new topic.

ADD REPLYlink written 2.2 years ago by Bob0
0
gravatar for Jennifer Hillman Jackson
2.2 years ago by
United States
Jennifer Hillman Jackson25k wrote:

Hello,

From reading in the other post (https://biostar.usegalaxy.org/p/19854), it seems that the incorrect Join tool was used. Please see my reply there for this and other details about how what you are doing deviates from the current tutorial.

Thanks, Jen, Galaxy team

ADD COMMENTlink written 2.2 years ago by Jennifer Hillman Jackson25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 172 users visited in the last hour