Pre-Processing Of Illumina Rna-Seq Paired End Data

Question: Pre-Processing Of Illumina Rna-Seq Paired End Data

6.8 years ago by

Ravi Karra • 10 wrote:

Hello, I have Illumina 76bp paired end data for a zebrafish RNA-seq experiment and am basically stuck while trying to pre-process my data prior to using Tophat/CuffDiff. For each sample, I have a read1 fastq file and a paired read2 fastq file. After using FASTQ Groomer, I trimmed the ends using FASTQ quality trimmer with a threshold quality score of 20 ans a window size of 1 (I think that will essentially lop off the end of the read until the quality score is >= 20). Next, I trimmed the adapters using Clip. What I am left with is a modified read1 fastq file and a modified read2 file, where the pairs are not in the same order and some reads are left without pairs. From what I have read, I don't think TopHat can incorporate paired end data that is out of order.. I tried to get around the ordering issue using FASTQ joiner, but this tool is not able to join the reads (return is 0 joined reads). I am not really sure why FASTQ joiner didn't work for me and am looking for suggestions of what to try next. Thanks! ravi

rna-seq cuffdiff • 4.0k views

ADD COMMENT • link •

modified 6.8 years ago by Victor Ruotti • 90 • written 6.8 years ago by Ravi Karra • 10

6.8 years ago by

Sameet Mehta • 10

Sameet Mehta • 10 wrote:

Hi, I think you need to first remove the adaptors and then trim the reads. That is probably the correct way. As for the second part of the question, you could try a rudimentary way to actually search for a sequence header. I have seen this different sizes in the r1 and r2 read files, but taken together almost 90% turn out to be true the paired reads. Hope this helps, Sameet -- Sameet Mehta, Ph.D., Phone: (301) 842-4791

ADD COMMENT • link written 6.8 years ago by Sameet Mehta • 10

6.8 years ago by

SHAUN WEBB • 70

SHAUN WEBB • 70 wrote:

Hi Ravi, I got around this problem by using the fastq interlacer to join reads in to a single file, then use deinterlacer to output only reads that have a pair in the correct order. You may need to alter read IDs first by adding /1 and /2 to the end (see interlacer help text). I used unix command line sed but I'm sure you can use galaxy tools to do this. Shaun Quoting Ravi Karra <ravi.karra@gmail.com> on Wed, 22 Feb 2012 12:29:18 -0500: -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

ADD COMMENT • link written 6.8 years ago by SHAUN WEBB • 70

6.8 years ago by

Victor Ruotti • 90

Victor Ruotti • 90 wrote:

Hi, I hope someone can help me on how to implement this into a wrapper. We would like to add an option so the user can set a sample name which then be used for the prefix of the output files names. For example, Is it possible to provide a sample name that can be used to prefix the output files? For example, could I specify a sample name "S" and have the output files be "S.gene_abundances" "S.isoform_abundances", "S.rsem_log", and "S.bam"? I know we can name the files from the xml, but is the a away to allow the user to pass this prefix without having to do a recursive conditional in the xml file to set this prefix? Or any other way people are doing this? Thanks in advance. Victor

ADD COMMENT • link written 6.8 years ago by Victor Ruotti • 90

This kind of question is normally redirected to the galaxy-dev list. You have no control over the file names at all - Galaxy will assign something like database/files/000/dataset_547.dat automatically. The user never sees the file names anyway. Are you asking about how to control the description/caption shown to the user in Galaxy? Peter

ADD REPLY • link written 6.8 years ago by Peter Cock • 1.4k

Please log in to add an answer.

Similar posts • Search »