Question: How to sample a subset from a fastq file?
0
gravatar for xiang-jiao.yang
4 months ago by
xiang-jiao.yang10 wrote:

Is it possible to use Galaxy to sample sample a subset from a fastq file? For example, to get 10 millions reads from a 50 million read file.

chip-seq • 277 views
ADD COMMENTlink modified 4 months ago by Jennifer Hillman Jackson23k • written 4 months ago by xiang-jiao.yang10
0
gravatar for Jennifer Hillman Jackson
4 months ago by
United States
Jennifer Hillman Jackson23k wrote:

Hello,

There are a few options, these are the top two for fastq input:

1) To select lines from a dataset (top or bottom), see the tools "select first/select last" in the group Text Manipulation. Fastq data will be accepted as a tabular input datatype, just make sure to select lines in multiples of "4". Fastq format has four lines for each read, so "10 million reads" would be "40 million lines".

2) To randomly select lines (fastq entries), try these tools, in this order. If it produces the output you want, the tools could be placed into a workflow for later reuse, in effect creating your own custom tool.

  • Convert Fastq to Tabular
  • Select random lines - or optionally, some functions of the tool Datamash
  • Convert Tabular to Fastq

You may need to reassign the datatype fastq/fastqsanger after either.

There are other tools to sample from BAM and VCF datasets (not Fastq directly). These could be an option if the data is already mapped or in other downstream analysis. Search with the term "sample" in the tool panel at http://usegalaxy.org to review the choices.

Thanks, Jen, Galaxy team

ADD COMMENTlink modified 4 months ago • written 4 months ago by Jennifer Hillman Jackson23k

Thanks. It is very helpful.

ADD REPLYlink written 4 months ago by xiang-jiao.yang10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 102 users visited in the last hour