Question: Selecting Reads At Random From Fastq File
0
gravatar for Austin Paul
6.9 years ago by
Austin Paul140
Austin Paul140 wrote:
Hi, I am curious if anyone knows how to select random lines from a fastq file. There is a select random lines tool in text manipulation tools, but it does not treat fastq files specifically, so it will not group quality lines with sequence lines. And if I turn the fastq file to tabular form in order to select lines, I can no longer return it to fastq form. Anyone know a way to do this in galaxy? Otherwise, perhaps another program? Thanks. Austin
galaxy • 2.3k views
ADD COMMENTlink modified 6.9 years ago by Jennifer Hillman Jackson25k • written 6.9 years ago by Austin Paul140
0
gravatar for Jennifer Hillman Jackson
6.9 years ago by
United States
Jennifer Hillman Jackson25k wrote:
Hello Austin, You have the correct method to do this all in Galaxy. Use the tool "NGS: QC and manipulation -> Tabular to FASTQ converter" to do the final step. Hopefully this helps, Jen Galaxy team -- Jennifer Jackson http://usegalaxy.org http://galaxyproject.org/wiki/Support
ADD COMMENTlink written 6.9 years ago by Jennifer Hillman Jackson25k
0
gravatar for Peter Cock
6.9 years ago by
Peter Cock1.4k
European Union
Peter Cock1.4k wrote:
How big are your FASTQ files (can they be indexed in memory)? And are you willing to program? If you like Python, Biopython's Bio.SeqIO.index(...) or Bio.SeqIO.index_db(...) functions would let you do this easily. Have a look at the "Getting the raw data for a record" example in the tutorial, and please ask if you liked a little more help: http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf Regards, Peter
ADD COMMENTlink written 6.9 years ago by Peter Cock1.4k
Hi Peter, Thanks for the suggestion. For example, I have a fastq file with 50 million reads and I want to randomly select 5 million of them. It seems biopython would very easily select a single or a handful of reads with the Bio.SeqIO.index() function. Would it also be able to do the job I am interested in? Austin
ADD REPLYlink written 6.9 years ago by Austin Paul140
I think so, but you'd have to use Bio.SeqIO.index_db() which stores the index in an SQLite dictionary rather than in memory which isn't really viable here (unless you have a 64bit big memory machine?). I don't think I've tried it with quite that many reads though... Alternatively, if I understood her correctly, Jennifer pointed out you can do this in Galaxy but it will take a lot of IO: 1. Convert FASTQ to tabular (4 lines per record -> 1 line per record) 2. Randomly select lines (each line is now a record so safe) 3. Convert tabular back to FASTQ It should work though, and requires no additional programming. Peter
ADD REPLYlink written 6.9 years ago by Peter Cock1.4k
Hi Paul, Hi Peter You might also wanna look at the 'FastqSampler' function in the Bioconductor 'ShortRead' package http://bioconductor.org/packages/release/bioc/html/ShortRead.html We are working (as part of our NGS pipeline redesign) on adding more Bioconductor functionalities to Galaxy. Unfortunately, it is very low on my pile of stuff to do, so it will take a while till it appears in the 'Tool Shed'. Regards, Hans
ADD REPLYlink written 6.9 years ago by Hotz, Hans-Rudolf1.8k
Hi, This may be a bit dumb or missing the point but just selecting the first 5 million is kind of random isn't it? I mean where the reads map and what they are from is not known to you and they were not collected by the sequencer in a manner that is influenced by the nature of the sample? Best Wishes, David. __________________________________ Dr David A. Matthews Senior Lecturer in Virology Room E49 Department of Cellular and Molecular Medicine, School of Medical Sciences University Walk, University of Bristol Bristol. BS8 1TD U.K. Tel. +44 117 3312058 Fax. +44 117 3312091 D.A.Matthews@bristol.ac.uk
ADD REPLYlink written 6.9 years ago by David Matthews630
David, in my experience with Illumina sequencing, it looks like the reads at the start of a file have a much higher sequencing error rate. Bob H
ADD REPLYlink written 6.9 years ago by Bob Harris190
Yes, reads at the start and the end of the file come from the edge of the Illumina slide, and tend to be of poorer quality that the reads from the middle. So depending on the purpose in mind, picking 5 million reads from the middle of the file might be fine (and much easier computationally). Peter
ADD REPLYlink written 6.9 years ago by Peter Cock1.4k
to the best of my knowledge reads at the start of SOLiD data set also have a higher error rate .. I think it might be also due to edge effect.
ADD REPLYlink written 6.9 years ago by Kevin Lam50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 162 users visited in the last hour