Question: How to sample a subset from a fastq file?
0
gravatar for xiang-jiao.yang
12 months ago by
xiang-jiao.yang10 wrote:

Is it possible to use Galaxy to sample sample a subset from a fastq file? For example, to get 10 millions reads from a 50 million read file.

chip-seq • 802 views
ADD COMMENTlink modified 12 months ago by Jennifer Hillman Jackson25k • written 12 months ago by xiang-jiao.yang10
0
gravatar for Jennifer Hillman Jackson
12 months ago by
United States
Jennifer Hillman Jackson25k wrote:

Hello,

There are a few options, these are the top two for fastq input:

1) To select lines from a dataset (top or bottom), see the tools "select first/select last" in the group Text Manipulation. Fastq data will be accepted as a tabular input datatype, just make sure to select lines in multiples of "4". Fastq format has four lines for each read, so "10 million reads" would be "40 million lines".

2) To randomly select lines (fastq entries), try these tools, in this order. If it produces the output you want, the tools could be placed into a workflow for later reuse, in effect creating your own custom tool.

  • Convert Fastq to Tabular
  • Select random lines - or optionally, some functions of the tool Datamash
  • Convert Tabular to Fastq

You may need to reassign the datatype fastq/fastqsanger after either.

There are other tools to sample from BAM and VCF datasets (not Fastq directly). These could be an option if the data is already mapped or in other downstream analysis. Search with the term "sample" in the tool panel at http://usegalaxy.org to review the choices.

Thanks, Jen, Galaxy team

ADD COMMENTlink modified 12 months ago • written 12 months ago by Jennifer Hillman Jackson25k

Thanks. It is very helpful.

ADD REPLYlink written 12 months ago by xiang-jiao.yang10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 119 users visited in the last hour