I'm running an RNA-seq analysis to look for differentially expressed genes. I'm using sing-end reads at 50bp from an an Illumina HiSeq Rapid V2 machine.
I have 3 conditions with 6 biological replicates in each. Also, for each biological replicate, I have 2 FASTQ files, one from each lane. I guess these are akin to technical replicates, except that each lane has unique reads that are cumulative. This is from the technician that ran the sequencing, for clarification:
"We loaded a single tube of pooled libraries on the HiSeq, which then deposits equal volumes on each lane, and the library fragments hybridize randomly across the surface. The two lanes are more like subsets of the whole dataset than replicates of each other."
Therefore, suppose the FASTQ file from lane 1 contains 5million reads, and the file from lane 2 contains another 5million reads, the idea is that the combined file should contain 10million cumulative reads.
I'm wondering what is the appropriate way (and at what step) should I combine these data. From what I understand, I should do quality control on the separate lane files first. But after that, I'm not sure if I should combine (somehow) and then align with Tophat, etc. or if should align and create my BAM files on Cufflinks first, then combine the data from each lane for each sample (using Sam Tools > Merge BAM files?).
Any help would be appreciated.