I like to filter my fastq files (50 bp single end Illumina RNA seq
reads) by a maximum threshold (10%) of ambiguous (N) bases.
I can see that the "CLIP" tool removes all reads with one or more N
Is there a way to remove only the reads with five or more N bases
There is no specific tool that I know of to do this based off read
content, but you could use the very low quality score (2) assigned to
ambiguous bases and the tool 'Filter by quality' to do a filter by
percentage. Be aware that other bases may have scores assigned to this
lower value, but these would very likely not be of practical usage
You could clip these end first, then do the filter, discarding any
have very short usable sequence left. If the data is Illumina, is
a sign of a sequence that failed vendor quality checks, and these are
longer removed by default as of Casava 1.8+.
Creating regular expression with the Select tool is another option,
this probably more effort than it is worth to construct. But, your
choice. A google will bring up syntax advice.
Ideally the first will do the job,
Galaxy Support and Training