Question: problem with 'filter' tool in main galaxy with split command
0
gravatar for azf127
4.2 years ago by
azf12720
United States
azf12720 wrote:

I have files which there is one column that I want to filter by number of member:

6,6,6,6,6,6,6,6,6,6

6,6,6

I want to remove all line which the number of member is less than 5. I decide to go with 'filter' tool in galaxy main

 len(c1.split(',')) < 5

However, the tool skip my data.

######################

Filtering with len(c1.split(',')) < 5 , kept 0.00% of 2 valid lines (2 total lines). Skipped 2 invalid line(s) starting at line #1: "6,6,6,6,6,6,6,6,6,6"

###################

I thought it's the column issue, but it seem that it's not. For example, I can use

'''

Filtering with len(c1)==5, kept 50.00% of 2 valid lines (2 total lines).

'''

 Any idea where the issue come from?

Thank you

 

filter • 1.4k views
ADD COMMENTlink modified 4.2 years ago by Jennifer Hillman Jackson25k • written 4.2 years ago by azf12720
2
gravatar for Jennifer Hillman Jackson
4.2 years ago by
United States
Jennifer Hillman Jackson25k wrote:

Hello,

Try the Select tool with:

       Matching

       ^(\w+,){4}\w+$

Hopefully this helps, Jen, Galaxy team

ADD COMMENTlink written 4.2 years ago by Jennifer Hillman Jackson25k

thanks Jennifer. However, it doesn't work out for my case. Also, in case that it works out, what should I do if there is more than one column?
Thanks

ADD REPLYlink written 4.2 years ago by azf12720
1

Oh, you want to remove less that five, not keep five or less. Then just switch to "not matching".

Or run and then use "Compare two Datasets" to find lines not in common, keeping lines of original.

Regular expressions have endless possibilities. Help is at the bottom of the tool form and some experimentation is almost always needed. If you have more than one column just pad either end with a greedy expression, like ".*". Make sure it is bound by tabs/whitespace expression, to preserve this as a distinct column. Something like:

^.%\s(\w+,){4}\w+\s.*$

If you have other columns with comma seperated data, this won't work, but you could always do something like "Add column" and iterate to add in line numbers, Cut out just the column you want to work with and the line number column, filter that, then join back the results based on the line number using the tool "Join two Datasets".

Once you work out path, save it into a workflow in case you want to do it again without the steps being so tedious.

Good luck! There should be a solution in here for any dataset I can think of. Jen, Galaxy team

ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by Jennifer Hillman Jackson25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 169 users visited in the last hour