Selecting Reads At Random From Fastq File

Question: Selecting Reads At Random From Fastq File

7.1 years ago by

Austin Paul • 140 wrote:

Hi, I am curious if anyone knows how to select random lines from a fastq file. There is a select random lines tool in text manipulation tools, but it does not treat fastq files specifically, so it will not group quality lines with sequence lines. And if I turn the fastq file to tabular form in order to select lines, I can no longer return it to fastq form. Anyone know a way to do this in galaxy? Otherwise, perhaps another program? Thanks. Austin

galaxy • 2.5k views

ADD COMMENT • link •

modified 7.1 years ago by Jennifer Hillman Jackson ♦ 25k • written 7.1 years ago by Austin Paul • 140

7.1 years ago by

Peter Cock • 1.4k

European Union

Peter Cock • 1.4k wrote:

How big are your FASTQ files (can they be indexed in memory)? And are you willing to program? If you like Python, Biopython's Bio.SeqIO.index(...) or Bio.SeqIO.index_db(...) functions would let you do this easily. Have a look at the "Getting the raw data for a record" example in the tutorial, and please ask if you liked a little more help: http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf Regards, Peter

ADD COMMENT • link written 7.1 years ago by Peter Cock • 1.4k

Hi Peter, Thanks for the suggestion. For example, I have a fastq file with 50 million reads and I want to randomly select 5 million of them. It seems biopython would very easily select a single or a handful of reads with the Bio.SeqIO.index() function. Would it also be able to do the job I am interested in? Austin

ADD REPLY • link written 7.1 years ago by Austin Paul • 140

I think so, but you'd have to use Bio.SeqIO.index_db() which stores the index in an SQLite dictionary rather than in memory which isn't really viable here (unless you have a 64bit big memory machine?). I don't think I've tried it with quite that many reads though... Alternatively, if I understood her correctly, Jennifer pointed out you can do this in Galaxy but it will take a lot of IO: 1. Convert FASTQ to tabular (4 lines per record -> 1 line per record) 2. Randomly select lines (each line is now a record so safe) 3. Convert tabular back to FASTQ It should work though, and requires no additional programming. Peter

ADD REPLY • link written 7.1 years ago by Peter Cock • 1.4k

Hi Paul, Hi Peter You might also wanna look at the 'FastqSampler' function in the Bioconductor 'ShortRead' package http://bioconductor.org/packages/release/bioc/html/ShortRead.html We are working (as part of our NGS pipeline redesign) on adding more Bioconductor functionalities to Galaxy. Unfortunately, it is very low on my pile of stuff to do, so it will take a while till it appears in the 'Tool Shed'. Regards, Hans

ADD REPLY • link written 7.1 years ago by Hotz, Hans-Rudolf • 1.8k

Hi, This may be a bit dumb or missing the point but just selecting the first 5 million is kind of random isn't it? I mean where the reads map and what they are from is not known to you and they were not collected by the sequencer in a manner that is influenced by the nature of the sample? Best Wishes, David. __________________________________ Dr David A. Matthews Senior Lecturer in Virology Room E49 Department of Cellular and Molecular Medicine, School of Medical Sciences University Walk, University of Bristol Bristol. BS8 1TD U.K. Tel. +44 117 3312058 Fax. +44 117 3312091 D.A.Matthews@bristol.ac.uk

ADD REPLY • link written 7.1 years ago by David Matthews • 630

David, in my experience with Illumina sequencing, it looks like the reads at the start of a file have a much higher sequencing error rate. Bob H

ADD REPLY • link written 7.1 years ago by Bob Harris • 190

Yes, reads at the start and the end of the file come from the edge of the Illumina slide, and tend to be of poorer quality that the reads from the middle. So depending on the purpose in mind, picking 5 million reads from the middle of the file might be fine (and much easier computationally). Peter

ADD REPLY • link written 7.1 years ago by Peter Cock • 1.4k

to the best of my knowledge reads at the start of SOLiD data set also have a higher error rate .. I think it might be also due to edge effect.

ADD REPLY • link written 7.1 years ago by Kevin Lam • 50

Similar posts • Search »