Move repeating sections of a column into new columns

Question: Move repeating sections of a column into new columns

15 months ago by

United States

Hello,

I have a large dataset with several columns with recurring identifiers in the first two columns. I'd like to move each recurring section into a new column.

For example:

Input file

1   ID-1   A   
2   ID-1   B
3   ID-1   C
4   ID-1   D
5   ID-1   E
1   ID-2   F
2   ID-2   G
3   ID-2   H
4   ID-2   I
5   ID-2   J

Output file

1   ID-1   A    ID-2     F
2   ID-1   B    ID-2     G
3   ID-1   C    ID-2     H
4   ID-1   D    ID-2     I
5   ID-1   E    ID-2     J

I can do this manually by using the filter tool to filter on column 2 (ID-1, ID-2) and then joining the resultant files, but this would be tedious, cumbersome, and altogether no fun since I would have to do this ~500 times.

Any ideas?

Thanks.

-Richard

text manipulation • 360 views

ADD COMMENT • link •

modified 15 months ago by Jennifer Hillman Jackson ♦ 25k • written 15 months ago by richard.radcliffe • 0

are you open to creating or installing new tools? or using bash?

what type of solution are you looking for?

EDIT: I have a quick and dirty one-line bash solution. If your interested.

ADD REPLY • link modified 15 months ago • written 15 months ago by danielsauceda • 60

15 months ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

This is a non-standard format, but you shouldn't have to filter and join each separately. For this solution, I am ignoring the first column of line numbers as I am not sure if those are actually in your dataset or not.

The IDs are a match without the -1 and -2, correct? If so, try this (3 steps):

Split the identifier to isolate the common portion of the ID with Convert delimiters to TAB using the option to convert "dashes" to a tab.

This:

IDONE-1   A   
IDTWO-1   B
IDONE-2   F
IDTWO-2   G

becomes:

IDONE  1   A   
IDTWO  1   B
IDONE  2   F
IDTWO  2   G

Create a dataset that contains just the 1 lines and another the 2 lines with Filter data on any column using simple expressions. Filtering on second column, run once for "c2==1" and run once for "c2==2".

The above becomes two datasets.

first:

IDONE  1   A   
IDTWO  1   B

second:

IDONE  2   F
IDTWO  2   G

Then there are two options.

If the columns are ordered exactly the same, use Paste two files side by side.

The above then becomes:

IDONE  1   A  IDONE  2   F
IDTWO  1   B  IDTWO  2   G

If not or you are not sure, use Join the two datasets together with Join two files. The common field is the first field containing the ID.

The above then becomes:

IDONE  1   A  2   F
IDTWO  1   B  2   G

More could be done to reconstruct the original ID names, retain/remove/reorder columns, and similar operations. These could also go into your workflow. See the tools in the group Text Manipulation.

Thanks, Jen, Galaxy team

ADD COMMENT • link modified 15 months ago • written 15 months ago by Jennifer Hillman Jackson ♦ 25k

Similar posts • Search »