Question: Comparing 2 Datasets producing files with larger size than originals. Why?
0
gravatar for justinpadinske
3.6 years ago by
justinpadinske0 wrote:

I have 17 genomic datasets. I filtered the data to include chromosome number and starting number into one column for each dataset. Then I used "Compare two datasets" to compare each dataset to another looking for commonalities. My goal is to create one file with one column of all the genetic information that is found among all 17 datasets.

When I compared the dataset, I took the newly created dataset and then compared it to the next genetic file and continued this process. For the first few times, the newly created file went down in size which makes sense as things not found between both datasets is removed. 

However, it gets below 1 million commonalities after about 4 comparisons and then starts going back up. How is it possible to compare 2 files for similarities, and the number gets bigger? 

galaxy • 689 views
ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by justinpadinske0
1
gravatar for Jennifer Hillman Jackson
3.6 years ago by
United States
Jennifer Hillman Jackson25k wrote:

Hello,

This would make sense if the entries in one file overlap with one or more in the other. A "many-to-many" comparison can often result in a very large output. You could try using the function "Group" to see if the merged column you are using for the comparison has duplicates in it.

Best, Jen, Galaxy team

ADD COMMENTlink written 3.6 years ago by Jennifer Hillman Jackson25k
0
gravatar for justinpadinske
3.6 years ago by
justinpadinske0 wrote:

Jennifer you are a wonderful person. It worked! Thank you so much.

ADD COMMENTlink written 3.6 years ago by justinpadinske0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 181 users visited in the last hour