problem with 'filter' tool in main galaxy with split command

Question: problem with 'filter' tool in main galaxy with split command

4.2 years ago by

azf127 • 20

United States

azf127 • 20 wrote:

I have files which there is one column that I want to filter by number of member:

6,6,6,6,6,6,6,6,6,6

6,6,6

I want to remove all line which the number of member is less than 5. I decide to go with 'filter' tool in galaxy main

len(c1.split(',')) < 5

However, the tool skip my data.

######################

Filtering with len(c1.split(',')) < 5 , kept 0.00% of 2 valid lines (2 total lines). Skipped 2 invalid line(s) starting at line #1: "6,6,6,6,6,6,6,6,6,6"

###################

I thought it's the column issue, but it seem that it's not. For example, I can use

'''

Filtering with len(c1)==5, kept 50.00% of 2 valid lines (2 total lines).

'''

Any idea where the issue come from?

Thank you

filter • 1.4k views

ADD COMMENT • link •

modified 4.2 years ago by Jennifer Hillman Jackson ♦ 25k • written 4.2 years ago by azf127 • 20

4.2 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

Try the Select tool with:

Matching

^(\w+,){4}\w+$

Hopefully this helps, Jen, Galaxy team

ADD COMMENT • link written 4.2 years ago by Jennifer Hillman Jackson ♦ 25k

thanks Jennifer. However, it doesn't work out for my case. Also, in case that it works out, what should I do if there is more than one column?
Thanks

ADD REPLY • link written 4.2 years ago by azf127 • 20

Oh, you want to remove less that five, not keep five or less. Then just switch to "not matching".

Or run and then use "Compare two Datasets" to find lines not in common, keeping lines of original.

Regular expressions have endless possibilities. Help is at the bottom of the tool form and some experimentation is almost always needed. If you have more than one column just pad either end with a greedy expression, like ".*". Make sure it is bound by tabs/whitespace expression, to preserve this as a distinct column. Something like:

^.%\s(\w+,){4}\w+\s.*$

If you have other columns with comma seperated data, this won't work, but you could always do something like "Add column" and iterate to add in line numbers, Cut out just the column you want to work with and the line number column, filter that, then join back the results based on the line number using the tool "Join two Datasets".

Once you work out path, save it into a workflow in case you want to do it again without the steps being so tedious.

Good luck! There should be a solution in here for any dataset I can think of. Jen, Galaxy team

ADD REPLY • link modified 4.2 years ago • written 4.2 years ago by Jennifer Hillman Jackson ♦ 25k

Similar posts • Search »