Question: Galaxy Join of two large datasets makes duplicate rows, but same rows not duplicated when smaller file subsets are joined
1
gravatar for boglabmgr
16 months ago by
boglabmgr10
boglabmgr10 wrote:

I'm using Galaxy, the Join tool. I've got two files, one a Locus ID and some data (file 1, ~220,000 rows), and the other with the Locus ID and an Annotation (file 2, ~220,000 rows), and I want to join them so I know what the annotations are that go with my data. There are some Locus IDs in the first list that are not in the second. So, using the Join tool, I

  • Join the two files on column 1, the Locus ID field.
  • I say "Keep lines of first input that do not join with second input" (yes), and other options at default "No" values.

The resulting file has 5,400,000 rows because many, many of the rows are being duplicated during the join. The first 10 rows of these two files become ~70 rows in the resulting joined file because rows for the same Locus ID (though joined as I would expect) are repeated 3-20 times. However, if I take only the first 10 rows of each file, make two new input files, and then do the same steps, I get only 10 rows of output. I'm not sure why the same data would be giving me two different outputs. Any insight?

galaxy • 445 views
ADD COMMENTlink modified 16 months ago by Jennifer Hillman Jackson25k • written 16 months ago by boglabmgr10
1
gravatar for Jennifer Hillman Jackson
16 months ago by
United States
Jennifer Hillman Jackson25k wrote:

Hello,

The join is likely finding more common rows with the larger input dataset. Duplicates are expected in many use-cases.

Was this result obtained from running the tool at http://usegalaxy.org or can it be reproduced there? If so, a shared history link can be sent in for us to confirm that, once the server is back up tomorrow. This is how: https://galaxyproject.org/issues/#usage-problem-reporting

Thanks, Jen, Galaxy team

ADD COMMENTlink written 16 months ago by Jennifer Hillman Jackson25k

No, since it was down today I used http://galaxy.wur.nl/galaxy_production/ and then also http://galaxy.informatik.uni-halle.de/ just to make sure. I can try again tomorrow when it's back up, though.

So, how would I get the annotations to be joined with the data only once, rather than multiple annotation rows joining with the same data (which I think is what you are saying is happening)?

ADD REPLYlink written 16 months ago by boglabmgr10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 183 users visited in the last hour