Question: Select FASTQ reads by sequence
0
gravatar for PaulW
23 months ago by
PaulW60
PaulW60 wrote:

Which tool should I use to select all reads from a FASTQ file which include any of about 100 short sequences given in a file such as:

ACAGTCAGCTAGCATCGATCCTAGCTAGAC GCATCACGACTACGACGTACATCTAGCATG etc

Is there a tool in Galaxy which will do this?

Alternatively would BBDUK work for this?

fastq • 927 views
ADD COMMENTlink modified 23 months ago by Jennifer Hillman Jackson25k • written 23 months ago by PaulW60

BTW the FASTQ reads are 150 base Illumina reads

ADD REPLYlink written 23 months ago by PaulW60
0
gravatar for Jennifer Hillman Jackson
23 months ago by
United States
Jennifer Hillman Jackson25k wrote:

Hello,

The tool Select can pattern match, one pattern at a time, through a regular expression. But that won't be the best solution for this query.

Instead, try a mapping tool such as Lastz. Create a custom reference genome of the 100 query sequences, map the fastq dataset, then filter the output by percent identity and coverage.

Help: https://wiki.galaxyproject.org/Support#Custom_reference_genome

I cannot help you with BBDUK, but perhaps someone else here can, or you can review other online sites (a google brings up much usage discussion).

Thanks, Jen, Galaxy team

ADD COMMENTlink written 23 months ago by Jennifer Hillman Jackson25k

Jenn, Thanks for that interesting suggestion. Unfortunately Lastz doesn't process fastq files. Worse, the galaxy implementation of Lastz apparently doesn't expose the "--ambiguous=iupac" command line option so converting fastq to fasta didn't work. I'll keep searching. There's a couple of things in the Toolshed look like they might help.

ADD REPLYlink written 23 months ago by PaulW60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 178 users visited in the last hour