Select lines in Galaxy (Filter-sort function)

Question: Select lines in Galaxy (Filter-sort function)

3.1 years ago by

United Kingdom

I try to select the lines whose 4th columns start with a hsa but I can't seem to get any results. What is the expression I have to use for that? I have been using c4='hsa' but it doesnt work

galaxy • 880 views

ADD COMMENT • link •

modified 3.1 years ago by Bjoern Gruening ♦ 5.1k • written 3.1 years ago by rafaela.michailidou • 0

3.1 years ago by

Bjoern Gruening ♦ 5.1k

Germany

Bjoern Gruening ♦ 5.1k wrote:

What comes always handy is the Python magic of "Filter data on any column using simple expressions". You can use this one as filter criteria: c4.startswith('hsa')

I have added this use case to the Galaxy-Tricks repository. Maybe you will find this useful:

https://github.com/bgruening/galaxy-tricks/commit/bc3e4fa2ab01a1468fbed0d3219d1573d366743c

Cheers,

Bjoern

ADD COMMENT • link written 3.1 years ago by Bjoern Gruening ♦ 5.1k

Nice! Even I didn't know that py function would work. This could be considered as an example directly on the tool form (along with others that are known to function).

Love that Galaxy-Tricks repo :) Jen

ADD REPLY • link written 3.1 years ago by Jennifer Hillman Jackson ♦ 25k

Glad you like it. Feel free to contribute :)

ADD REPLY • link written 3.1 years ago by Bjoern Gruening ♦ 5.1k

On my "list"! :)

ADD REPLY • link written 3.1 years ago by Jennifer Hillman Jackson ♦ 25k

3.1 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

The contents of the data field must be exactly the contents between the two quotes in order for the Filter tool to isolate rows from the input.

The Select tool could be a better choice if the field contains other data. An expression like this will find data in the 4th column that start with hsa (but may have more content). It will be important to know the exact number of columns - otherwise "greedy" expression such as ".*" will be imprecise.

^.*\t.*\t.*\thsa.*/t[add in more .*\t expressions until you reach the last column, then use this to capture the last column].*$

breakdown:

^ = start of line

.* = one or more characters, can capture nearly any content (greedy)

\t = must be a tab

hsa.* = specifies a value that starts with hsa. when bounded by tabs and all other fields in the row are bound by tabs, this can isolate and filter on the 4th column

.*$ = designates the last column, that can be of any content (without a tab), and when bounded by a starting tab and ending with a $, isolates the last field. ($ alone is always the end of the line)

More help on regular expressions is on the tool form and many places online. These can be simple or complicated and a few tests are sometimes needed to tune an expression to do exactly what you want with the given data. Often there are several ways to build an expression, this is a simple way to do the one you want.

Hope this helps!

Jen, Galaxy team

ADD COMMENT • link written 3.1 years ago by Jennifer Hillman Jackson ♦ 25k

Please log in to add an answer.

Similar posts • Search »