How to sample a subset from a fastq file?

Heads up! This is a static archive of our support site. Please go to help.galaxyproject.org if you want to reach the Galaxy community. If you want to search this archive visit the Galaxy Hub search

Latest

Open

RNA-Seq

ChIP-Seq

SNP

Assembly

Forum

Home

Welcome to Galaxy Biostar! User support for Galaxy! about • faq • rss

Log In

Sign Up

Question: How to sample a subset from a fastq file?

0

16 months ago by

xiang-jiao.yang • 10

xiang-jiao.yang • 10 wrote:

Is it possible to use Galaxy to sample sample a subset from a fastq file? For example, to get 10 millions reads from a 50 million read file.

chip-seq • 1.2k views

ADD COMMENT • link •

modified 16 months ago by Jennifer Hillman Jackson ♦ 25k • written 16 months ago by xiang-jiao.yang • 10

0

16 months ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

There are a few options, these are the top two for fastq input:

1) To select lines from a dataset (top or bottom), see the tools "select first/select last" in the group Text Manipulation. Fastq data will be accepted as a tabular input datatype, just make sure to select lines in multiples of "4". Fastq format has four lines for each read, so "10 million reads" would be "40 million lines".

2) To randomly select lines (fastq entries), try these tools, in this order. If it produces the output you want, the tools could be placed into a workflow for later reuse, in effect creating your own custom tool.

Convert Fastq to Tabular
Select random lines - or optionally, some functions of the tool Datamash
Convert Tabular to Fastq

You may need to reassign the datatype fastq/fastqsanger after either.

There are other tools to sample from BAM and VCF datasets (not Fastq directly). These could be an option if the data is already mapped or in other downstream analysis. Search with the term "sample" in the tool panel at http://usegalaxy.org to review the choices.

Thanks, Jen, Galaxy team

ADD COMMENT • link modified 16 months ago • written 16 months ago by Jennifer Hillman Jackson ♦ 25k

Thanks. It is very helpful.

ADD REPLY • link written 16 months ago by xiang-jiao.yang • 10

Please log in to add an answer.

Similar posts • Search »

Galaxy problems plus tech info
Dear Office, my user is fradiancona@yahoo.it I uploaded library files in Galaxy (fastq.gz files)...
Extracting A Subset Of Sequences From A Very Large Fasta File(1.5 Million)
I have successfully uploaded a large fasta file (2.5 million genomic sequence contigs) onto Galax...
mapping RNA-seq reads with "N" in the middle of each read
Hi all, I am performing differential gene expression analysis using the Tophat-Cuffdiff protocol...
How do I obtain uniquely mappable reads from bowtie on galaxy?
Hello, I'm using bowtie on Galaxy to map my RNA-seq reads to the sacCer2 yeast ref genome. I was...
Allelic sequencing using NGS
Hello, I am trying to assess indels after CRISPR/Cas9 editing of a gene of interest. I extracted ...
Issues On Rnaseq Since The Changeover
My histories seem to be stopping their processing around the Tophat-cuffmerge steps since the cha...
Combining Reads from 2 Lanes
Hi, I'm running an RNA-seq analysis to look for differentially expressed genes. I'm using sing-e...
Tophat Results
Dear galaxy users, I aligned my RNA-seq data by using Tophat in galaxy. It generated some "...
Mapping Unmapped Sequences To Other Guessed Contaminants
Hi, I am a relatively new user to Galaxy. Of the 21 million mappable illumina reads, 17 million ...
Total genomic DNA Concentration for cows (healthy and infected) milk microbiome study
I have extracted DNA from cows milk using Promega FFS Nucleic Acid extraction kits and Promega bl...
CRISPR/Cas9 and allelic NGS to assess indels homozygocy or heterozygocy
Hi everyone, I am trying to assess indels after CRISPR/Cas9 editing of a gene of interest. I ext...
Need help with 'barcode splitter'
I have a Fastq dataset obtained from Miseq with 24 million reads. When I use the barcode splitter...
Re: Output File For Bowtie Suppressed Reads (-M/--Max)
Hi all, I'm running into a problem with the output from bowtie mapping for illumina reads. I'v...
Too few reads after VarScan on RNA-Seq data?
Hi all, I have been trying to initiate a protocol to call SNPs in RNA-Seq data, but have had a f...
Considerations for trimming poor quality Illumina MiSeq paired end reads
I have generated fastq files from Illumina MiSeq for bacterial genome sequencing. My reads are 2...
Why gene counts from RNA STAR don't match total uniquely mapped counts
Hi, I used RNA STAR to map my reads for a stranded-RNAseq library. Within RNA STAR, i turned on...
Why gene counts from RNA STAR don't match total uniquely mapped counts
Hi, I used RNA STAR to map my reads for a stranded-RNAseq library. Within RNA STAR, i turned on...
Running MACS2 without a control sample
While going through the [Analysis of Chip-Seq][1] data tutorial, I realized that one of my input ...
Analysis of forward and reverse reads
Hello. Newbie here: I have recently done some chip-seq, and for each sample, I have different ...
Less reads when uploading to galaxy
Hello, I am trying to upload files which contains, each, about 40 millions of reads (~2GB each, ...
Joining Two Fastq Files With Overlap Reads
I have a data generated from Miseq 2X250 bp these reads are overlap, before aligning to a my cust...

Content

Help

About
FAQ

Access

RSS
Stats
API

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by Biostar version 16.09

Traffic: 168 users visited in the last hour