Question: How To Filter The Sequences Containing Not[Atcg] Character?
0
gravatar for 师云
5.0 years ago by
师云110
师云110 wrote:
Hi Jen,As the title, I have a [fasta] file that obtained from a [gtf] file, and I want to get the output like this to filter the sequences that contain any not[ATCG] character? I have a large of sequences to filter. I thought a way that firstly convert the file to [interval] file, and secondly SELECT the line not matching the patten /\t[ATCGatcg]*[^ATCGatcg]/.Am I right? Or there is a one-step way ?
• 924 views
ADD COMMENTlink modified 5.0 years ago by Jennifer Hillman Jackson25k • written 5.0 years ago by 师云110
0
gravatar for Jennifer Hillman Jackson
5.0 years ago by
United States
Jennifer Hillman Jackson25k wrote:
Hello, If the data was in .fastqsanger format, you could use the tool "Manipulate FASTQ", but with .fasta, this is a good way. But watch your regular expression - test it out on a smaller set to make sure it is doing what you want. I see a "start of the line" character in the middle of your expression ("^"). I see why it could be working, with the prior expression being zero or more (*), but knowing what each character does is generally a good idea. The help on the tool is good as are many web sites, but this is simple. Also, you don't need the // slashes, just enter the expression. To get you started: I would use something like this, with the Select tool and "Matching": ^..*\t[ATCGatcg]+$ (Only one dot is really required, this is just how I always do it. Adds a bit of a format sanity check into the filter). Hope this helps! Jen Galaxy team -- Jennifer Hillman-Jackson http://galaxyproject.org
ADD COMMENTlink written 5.0 years ago by Jennifer Hillman Jackson25k
Hi, It indeed helps.Your regular expression looks brief and more useful.BTW, a start of line (^) between [] and in the first location, for example, [^ATCGatcg] means a character not [ATCGatcg], which maybe not work in the tool SELECT. Thank you for your help! Date: Mon, 9 Dec 2013 06:34:28 -0800 To: zhusy88@msn.cn; galaxy-user@lists.bx.psu.edu Subject: Re: [galaxy-user] How to filter the sequences containing not[ATCG] characterŁż Hello, If the data was in .fastqsanger format, you could use the tool "Manipulate FASTQ", but with .fasta, this is a good way. But watch your regular expression - test it out on a smaller set to make sure it is doing what you want. I see a "start of the line" character in the middle of your expression ("^"). I see why it could be working, with the prior expression being zero or more (*), but knowing what each character does is generally a good idea. The help on the tool is good as are many web sites, but this is simple. Also, you don't need the // slashes, just enter the expression. To get you started: I would use something like this, with the Select tool and "Matching": ^..*\t[ATCGatcg]+$ (Only one dot is really required, this is just how I always do it. Adds a bit of a format sanity check into the filter). Hope this helps! Jen Galaxy team Hi Jen, As the title, I have a [fasta] file that obtained from a [gtf] file, atcgtaaagggcgat gtcgttgactNNNNNNNNgtc and I want to get the output like this to filter the sequences that contain any not[ATCG] character? atcgtaaagggcgat I have a large of sequences to filter. I thought a way that firstly convert the file to [interval] file, and secondly SELECT the line not matching the patten /\t[ATCGatcg]*[^ATCGatcg]/. Am I right? Or there is a one-step way ? ___________________________________________________________ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/ -- Jennifer Hillman-Jackson http://galaxyproject.org
ADD REPLYlink written 5.0 years ago by 师云110
Hello, You are right! I forgot about that. Aren't regular expressions fun? And please test it out, if you prefer your method or are just curious, I didn't try it that way. There are usually a few ways to do the same thing when using a regex. But, I am glad that this helped a bit and good luck with the query, Jen Galaxy team -- Jennifer Hillman-Jackson http://galaxyproject.org
ADD REPLYlink written 5.0 years ago by Jennifer Hillman Jackson25k
Hello, Yeah, it's interesting!I have tried and something like "[^ATCGatcg]" is useful.I have a large file to deal with so I will search something to choose an efficient regular expresson. Thank you. Date: Mon, 9 Dec 2013 07:24:46 -0800 To: zhusy88@msn.cn CC: galaxy-user@lists.bx.psu.edu Subject: Re: [galaxy-user] How to filter the sequences containing not[ATCG] characterŁż Hello, You are right! I forgot about that. Aren't regular expressions fun? And please test it out, if you prefer your method or are just curious, I didn't try it that way. There are usually a few ways to do the same thing when using a regex. But, I am glad that this helped a bit and good luck with the query, Jen Galaxy team Hi, It indeed helps. Your regular expression looks brief and more useful. BTW, a start of line (^) between [] and in the first location, for example, [^ATCGatcg] means a character not [ATCGatcg], which maybe not work in the tool SELECT. Thank you for your help! Date: Mon, 9 Dec 2013 06:34:28 -0800 To: zhusy88@msn.cn; galaxy-user@lists.bx.psu.edu Subject: Re: [galaxy-user] How to filter the sequences containing not[ATCG] characterŁż Hello, If the data was in .fastqsanger format, you could use the tool "Manipulate FASTQ", but with .fasta, this is a good way. But watch your regular expression - test it out on a smaller set to make sure it is doing what you want. I see a "start of the line" character in the middle of your expression ("^"). I see why it could be working, with the prior expression being zero or more (*), but knowing what each character does is generally a good idea. The help on the tool is good as are many web sites, but this is simple. Also, you don't need the // slashes, just enter the expression. To get you started: I would use something like this, with the Select tool and "Matching": ^..*\t[ATCGatcg]+$ (Only one dot is really required, this is just how I always do it. Adds a bit of a format sanity check into the filter). Hope this helps! Jen Galaxy team On 12/8/13 6:21 PM, ÖěĘŚÔĆ Hi Jen, As the title, I have a [fasta] file that obtained from a [gtf] file, atcgtaaagggcgat gtcgttgactNNNNNNNNgtc and I want to get the output like this to filter the sequences that contain any not[ATCG] character? atcgtaaagggcgat I have a large of sequences to filter. I thought a way that firstly convert the file to [interval] file, and secondly SELECT the line not matching the patten /\t[ATCGatcg]*[^ATCGatcg]/. Am I right? Or there is a one-step way ? ___________________________________________________________ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/ -- Jennifer Hillman-Jackson http://galaxyproject.org -- Jennifer Hillman-Jackson http://galaxyproject.org
ADD REPLYlink written 5.0 years ago by 师云110
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 173 users visited in the last hour