Paired-end Reads trimming

Question: Paired-end Reads trimming

4.5 years ago by

araujo.s.leonardo • 0

Germany

araujo.s.leonardo • 0 wrote:

Hey there. I am starting to analyse some RNAseq paired-end data.

At the pre-processing step I got stucked , I have one simple doubt:

Should I trim my samples for :

>Illumina_Single_End_PCR_Primer_1? (AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT)

>Illumina_Multiplexing_Read2_Sequencing_Primer? (GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT)

>scseqR829in11 (ScriptSeq Libraries)?

(CAAGCAGAAGACGGCATACGAGATGTAGCCGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA)

And also for their respectives reverse complement ?

Thanks a lot

rna-seq • 3.9k views

ADD COMMENT • link •

modified 4.4 years ago by Jennifer Hillman Jackson ♦ 25k • written 4.5 years ago by araujo.s.leonardo • 0

Have you run FASTQC on your dataset? FASTQC will tell you if you need to trim your sequences or if you have contaminations.

Ciao,

Bjoern

ADD REPLY • link written 4.5 years ago by Bjoern Gruening ♦ 5.1k

4.5 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hi Araujo,

For QA steps, definitely run the tool "FastQC" first on uploaded datasets, as Bjoern suggests. This is good for a few reasons:

This will help with making sure that the correct quality score scaling is adjusted (if needed) and with deciding how to trim. This tool and others used to prep data are in the tool group "NGS: QC and manipulation" or use the tool search at the top of the left tool panel if you know the tool name.
Sometimes you will want to run this tool twice - once on the dataset (or a sample of it, to speed up processing/use less space a sample can be enough) to detect and adjust quality scores, as described here: https://wiki.galaxyproject.org/Support#Dataset_special_cases This is an important first step for any newly uploaded fastq dataset. Then, if you do run the tool "Fastq Groomer" to resale the quality scores, run "FastQC'" again on the entire new dataset for further QC.

You may be aware of this, but it seems worth mentioning while on the subject of QA/QC. How much you want to do in terms of trimming or filtering on quality or on mapping result status will depend on the type of downstream analysis you plan to do.

If proceeding with an expression analysis workflow (Tophat, Cufflinks, etc), then the less you do to alter the data beyond basic artifact removal is often better, as you'll map more data and avoid skewing results - the tuxedo pipeline on usegalaxy.org under the tool group "NGS:RNA-Seq" will perform filtering for you (some is built-it, other are tool options on the forms).
But if you are plan on performing a workflow that involves variant calling, a bit more QC to use the highest quality sequence in the beginning (more aggressive quality trimming and potential low-quality sequence filtering) and later filtering for properly mapped matched pair ends are common choices before doing to the calling (in addition to setting tool form options to screen for statically significant variants, plus some tools are more sensitive than others by default).

We have an updated tutorial for these exact workflows in progress (Dan and I), and GCC this year will include a session on similar content during Training Day (Tom Bair from Univ. of Iowa and I), and those resources will be on usegalaxy.org as Page (with included workflows, datasets, histories), linked to our Learn & Support wikis, plus a short version (linked to the full tutorial) will be placed under 'Tutorials' here on Galaxy Biostar > within the next few weeks. But for right now, others from the community and our team have related tutorials available, should these be of interested (to you, or others reading this post). See our Learn wiki resources plus the RNA-seq wiki hub for the links:
https://wiki.galaxyproject.org/Support#Learning_Hub
https://wiki.galaxyproject.org/Support#Tools_on_the_Main_server:_RNA-seq

Thanks! Jen, Galaxy team

ADD COMMENT • link written 4.5 years ago by Jennifer Hillman Jackson ♦ 25k

4.4 years ago by

araujo.s.leonardo • 0

Germany

araujo.s.leonardo • 0 wrote:

Hello! I still working with the paired end reads.

I pre-pocessed the same reads by two different ways.

First I joined them and did the Clip and Quality trim steps, then map using tophat2.

second, I Clip and Quality trim the reads separately, then map using tophat2.

After this I merge both Bam files.

Am I doing something wrong ?

Thanks in advance

ADD COMMENT • link written 4.4 years ago by araujo.s.leonardo • 0

4.4 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hi,

Both are these are close - here is what you'll want to do:

1. Do the QA/QC steps as individual datasets.

2. Map both in the same run, but as distinct datasets. There is an option for paired-end data input. Select this. The form will reset so that you can enter the forward and reverse reads both in the same form, for the same job run.

3. This will produce a single output BAM dataset.

4. From here you can proceed with downstream RNA-seq analysis.

5. Protocol help is in our hub here, the tutorials will likely be helpful:
http://wiki.galaxyproject.org/Support#Tools_on_the_Main_server:_RNA-seq

Best, Jen, Galaxy team

ADD COMMENT • link written 4.4 years ago by Jennifer Hillman Jackson ♦ 25k

Dear Jen, thanks for the reply. So, you are suggesting me to do NOT join the reads, but to pre-process them separately?

ADD REPLY • link written 4.4 years ago by araujo.s.leonardo • 0

Correct - there is no need to join, as shown in the protocols.

Thanks! Jen, Galaxy team

ADD REPLY • link written 4.4 years ago by Jennifer Hillman Jackson ♦ 25k

Please log in to add an answer.

Similar posts • Search »