Question: Move repeating sections of a column into new columns
0
gravatar for richard.radcliffe
15 months ago by
United States
richard.radcliffe0 wrote:

Hello,

I have a large dataset with several columns with recurring identifiers in the first two columns. I'd like to move each recurring section into a new column.

For example:

Input file

1   ID-1   A   
2   ID-1   B
3   ID-1   C
4   ID-1   D
5   ID-1   E
1   ID-2   F
2   ID-2   G
3   ID-2   H
4   ID-2   I
5   ID-2   J

Output file

1   ID-1   A    ID-2     F
2   ID-1   B    ID-2     G
3   ID-1   C    ID-2     H
4   ID-1   D    ID-2     I
5   ID-1   E    ID-2     J

I can do this manually by using the filter tool to filter on column 2 (ID-1, ID-2) and then joining the resultant files, but this would be tedious, cumbersome, and altogether no fun since I would have to do this ~500 times.

Any ideas?

Thanks.

-Richard

text manipulation • 360 views
ADD COMMENTlink modified 15 months ago by Jennifer Hillman Jackson25k • written 15 months ago by richard.radcliffe0

are you open to creating or installing new tools? or using bash?

what type of solution are you looking for?

EDIT: I have a quick and dirty one-line bash solution. If your interested.

ADD REPLYlink modified 15 months ago • written 15 months ago by danielsauceda60
0
gravatar for Jennifer Hillman Jackson
15 months ago by
United States
Jennifer Hillman Jackson25k wrote:

Hello,

This is a non-standard format, but you shouldn't have to filter and join each separately. For this solution, I am ignoring the first column of line numbers as I am not sure if those are actually in your dataset or not.

The IDs are a match without the -1 and -2, correct? If so, try this (3 steps):

  1. Split the identifier to isolate the common portion of the ID with Convert delimiters to TAB using the option to convert "dashes" to a tab.

This:

IDONE-1   A   
IDTWO-1   B
IDONE-2   F
IDTWO-2   G

becomes:

IDONE  1   A   
IDTWO  1   B
IDONE  2   F
IDTWO  2   G
  1. Create a dataset that contains just the 1 lines and another the 2 lines with Filter data on any column using simple expressions. Filtering on second column, run once for "c2==1" and run once for "c2==2".

The above becomes two datasets.

first:

IDONE  1   A   
IDTWO  1   B

second:

IDONE  2   F
IDTWO  2   G
  1. Then there are two options.

If the columns are ordered exactly the same, use Paste two files side by side.

The above then becomes:

IDONE  1   A  IDONE  2   F
IDTWO  1   B  IDTWO  2   G

If not or you are not sure, use Join the two datasets together with Join two files. The common field is the first field containing the ID.

The above then becomes:

IDONE  1   A  2   F
IDTWO  1   B  2   G

More could be done to reconstruct the original ID names, retain/remove/reorder columns, and similar operations. These could also go into your workflow. See the tools in the group Text Manipulation.

Thanks, Jen, Galaxy team

ADD COMMENTlink modified 15 months ago • written 15 months ago by Jennifer Hillman Jackson25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 178 users visited in the last hour