Galaxy Join of two large datasets makes duplicate rows, but same rows not duplicated when smaller file subsets are joined

Question: Galaxy Join of two large datasets makes duplicate rows, but same rows not duplicated when smaller file subsets are joined

16 months ago by

boglabmgr • 10

boglabmgr • 10 wrote:

I'm using Galaxy, the Join tool. I've got two files, one a Locus ID and some data (file 1, ~220,000 rows), and the other with the Locus ID and an Annotation (file 2, ~220,000 rows), and I want to join them so I know what the annotations are that go with my data. There are some Locus IDs in the first list that are not in the second. So, using the Join tool, I

Join the two files on column 1, the Locus ID field.
I say "Keep lines of first input that do not join with second input" (yes), and other options at default "No" values.

The resulting file has 5,400,000 rows because many, many of the rows are being duplicated during the join. The first 10 rows of these two files become ~70 rows in the resulting joined file because rows for the same Locus ID (though joined as I would expect) are repeated 3-20 times. However, if I take only the first 10 rows of each file, make two new input files, and then do the same steps, I get only 10 rows of output. I'm not sure why the same data would be giving me two different outputs. Any insight?

galaxy • 445 views

ADD COMMENT • link •

modified 16 months ago by Jennifer Hillman Jackson ♦ 25k • written 16 months ago by boglabmgr • 10

16 months ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

The join is likely finding more common rows with the larger input dataset. Duplicates are expected in many use-cases.

Was this result obtained from running the tool at http://usegalaxy.org or can it be reproduced there? If so, a shared history link can be sent in for us to confirm that, once the server is back up tomorrow. This is how: https://galaxyproject.org/issues/#usage-problem-reporting

Thanks, Jen, Galaxy team

ADD COMMENT • link written 16 months ago by Jennifer Hillman Jackson ♦ 25k

No, since it was down today I used http://galaxy.wur.nl/galaxy_production/ and then also http://galaxy.informatik.uni-halle.de/ just to make sure. I can try again tomorrow when it's back up, though.

So, how would I get the annotations to be joined with the data only once, rather than multiple annotation rows joining with the same data (which I think is what you are saying is happening)?

ADD REPLY • link written 16 months ago by boglabmgr • 10

Similar posts • Search »