Multiple FASTQ files per replicate

Question: Multiple FASTQ files per replicate

3.5 years ago by

United States

nashedm • 10 wrote:

Hi There,

I'm running an RNA-seq analysis using FASTQ data from an Illumina HiSeq Rapid V2 machine (single reads at 50bp). I don't have experience with UNIX coding so I am using Galaxy, and specifically the Tuxedo applications to align/map my reads, then preform differential analysis (probably with Cuffdiff).

For my data, I have 3 conditions with 6 biological replicates in each. In addition each biological replicate was run on 2 lanes so I have 2 technical replicates per sample. In addition to all that, the way I received data for each technical replicate was in 2 separate FASTQ files. The technician mentioned something about the machine automatically creating a new file when it hits about 200mb or something.

So I'm wondering about the proper method of combining these files. First of all, how do I combine the 2 FASTQ files for each technical replicate? I imagine this has to be done early prior to mapping with Tophat? Secondly, at which step should I combine the technical replicates? At the differential expression analysis step I should only be comparing biological replicates- including separate technical replicates would be pseudoreplication. So I imagine combining my technical replicates happens prior to the Cuffdiff step, but I'm not sure if this happens prior to mapping or after.

Any help would be greatly appreciated!

rna-seq tophat alignment galaxy • 2.8k views

ADD COMMENT • link •

modified 3.5 years ago by Jennifer Hillman Jackson ♦ 25k • written 3.5 years ago by nashedm • 10

3.5 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

Merge multiple fastq files representing a single sample by using the tool Concatenate datasets tail-to-head. It is okay to do QC first (in order to narrow down where lab issues may have occurred), but then merge before doing anything else. These are paired end? Merge the forward datasets together, then merge the reverse datasets together. Run each sample's pair through a mapping tool like Tophat.

More about RNA-seq is in the Galaxy wiki and many other places, including the home page for the Tuxedo pipl line.

Best, Jen, Galaxy team

ADD COMMENT • link written 3.5 years ago by Jennifer Hillman Jackson ♦ 25k

Thanks Jen,

These are single-end sequences.

I clarified with the technician and she would not treat sequences from 2 lanes for one sample as technical replicates (because data was not collected twice from the same prep). Rather, she said to treat the data from the 2 lanes as subsets of the same sequence. So they should be combined cumulatively such that if each lane produced 5million reads, the combined sequence would be 10million reads. What tool would be appropriate to use to perform this type of merge? Would it be the "concatenate two datasets into one dataset" option under "operate on genomic interval"?

Thanks Again.

ADD REPLY • link written 3.5 years ago by nashedm • 10

Thanks Jen,

These are single-end sequences.

Thanks Again.

ADD REPLY • link written 3.5 years ago by nashedm • 10

Similar posts • Search »