I'm using Galaxy, the Join tool. I've got two files, one a Locus ID and some data (file 1, ~220,000 rows), and the other with the Locus ID and an Annotation (file 2, ~220,000 rows), and I want to join them so I know what the annotations are that go with my data. There are some Locus IDs in the first list that are not in the second. So, using the Join tool, I
- Join the two files on column 1, the Locus ID field.
- I say "Keep lines of first input that do not join with second input" (yes), and other options at default "No" values.
The resulting file has 5,400,000 rows because many, many of the rows are being duplicated during the join. The first 10 rows of these two files become ~70 rows in the resulting joined file because rows for the same Locus ID (though joined as I would expect) are repeated 3-20 times. However, if I take only the first 10 rows of each file, make two new input files, and then do the same steps, I get only 10 rows of output. I'm not sure why the same data would be giving me two different outputs. Any insight?