Question: Filter Fastq By Percentage Of Ambiguous (N) Bases
gravatar for Anto Praveen Rajkumar Rajamani
5.3 years ago by
Hello, I like to filter my fastq files (50 bp single end Illumina RNA seq reads) by a maximum threshold (10%) of ambiguous (N) bases. I can see that the "CLIP" tool removes all reads with one or more N bases. Is there a way to remove only the reads with five or more N bases using Galaxy? Thank you. Best wishes, Anto
galaxy • 2.6k views
ADD COMMENTlink modified 5.3 years ago by Jennifer Hillman Jackson25k • written 5.3 years ago by Anto Praveen Rajkumar Rajamani80
gravatar for Jennifer Hillman Jackson
5.3 years ago by
United States
Jennifer Hillman Jackson25k wrote:
Hello Anto, There is no specific tool that I know of to do this based off read content, but you could use the very low quality score (2) assigned to ambiguous bases and the tool 'Filter by quality' to do a filter by percentage. Be aware that other bases may have scores assigned to this lower value, but these would very likely not be of practical usage anyway. You could clip these end first, then do the filter, discarding any that have very short usable sequence left. If the data is Illumina, is likely a sign of a sequence that failed vendor quality checks, and these are no longer removed by default as of Casava 1.8+. Creating regular expression with the Select tool is another option, but this probably more effort than it is worth to construct. But, your choice. A google will bring up syntax advice. Ideally the first will do the job, Jen Galaxy team -- Jennifer Hillman-Jackson Galaxy Support and Training
ADD COMMENTlink written 5.3 years ago by Jennifer Hillman Jackson25k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 169 users visited in the last hour