Question: How to sample a subset from a fastq file?
0
gravatar for xiang-jiao.yang
16 months ago by
xiang-jiao.yang10 wrote:

Is it possible to use Galaxy to sample sample a subset from a fastq file? For example, to get 10 millions reads from a 50 million read file.

chip-seq • 1.2k views
ADD COMMENTlink modified 16 months ago by Jennifer Hillman Jackson25k • written 16 months ago by xiang-jiao.yang10
0
gravatar for Jennifer Hillman Jackson
16 months ago by
United States
Jennifer Hillman Jackson25k wrote:

Hello,

There are a few options, these are the top two for fastq input:

1) To select lines from a dataset (top or bottom), see the tools "select first/select last" in the group Text Manipulation. Fastq data will be accepted as a tabular input datatype, just make sure to select lines in multiples of "4". Fastq format has four lines for each read, so "10 million reads" would be "40 million lines".

2) To randomly select lines (fastq entries), try these tools, in this order. If it produces the output you want, the tools could be placed into a workflow for later reuse, in effect creating your own custom tool.

  • Convert Fastq to Tabular
  • Select random lines - or optionally, some functions of the tool Datamash
  • Convert Tabular to Fastq

You may need to reassign the datatype fastq/fastqsanger after either.

There are other tools to sample from BAM and VCF datasets (not Fastq directly). These could be an option if the data is already mapped or in other downstream analysis. Search with the term "sample" in the tool panel at http://usegalaxy.org to review the choices.

Thanks, Jen, Galaxy team

ADD COMMENTlink modified 16 months ago • written 16 months ago by Jennifer Hillman Jackson25k

Thanks. It is very helpful.

ADD REPLYlink written 16 months ago by xiang-jiao.yang10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 168 users visited in the last hour