Question: Workflow to clean data prior to variant call
gravatar for
3.5 years ago by
morgane.moreau.info30 wrote:


I would like some advice on the workflow I'm using.

1- I have WGS data from bacteria, I've uploaded my reference genome and fastq files through FTP.

2- FastQC analysis. 

3-  FastQ Groomer to get my fastq files in fastqillumina format prior to mapping

4-  Mapping to ref. genome with BWA for Illumina

5- Picard alignment summary metrics (not sure how to interpret all of the output yet, but I'll get there). 

Now I think I should get rid of unpaired reads and remove duplicates . Or should I have done that before mapping ?

Can I use the "create matched paired end dataset workflow" for that ? (I can only select FastQ groomer files as RAW files. Shouldn't I use the original fastq files ?)

Any feedbacks would be much appreciated. 



ADD COMMENTlink modified 3.5 years ago by Jennifer Hillman Jackson24k • written 3.5 years ago by morgane.moreau.info30
gravatar for Jennifer Hillman Jackson
3.5 years ago by
United States
Jennifer Hillman Jackson24k wrote:


The format you want to create is ".fastqsanger" prior to mapping with BWA. If you really have data that has quality scores scaled to ".fastqillumina" format, then you will be running the tool "FASTQ Groomer" as a prep step. This wiki explains how to detect format and either assign or covert to the proper one. The steps here would proceed any in the shared example below.

In short, once in ".fastqsanger" format, then you can run FastQC again for the actual QC statistics on your data. Use the results to perform QC the sequences as needed (it isn't always). Tools to trim, remove adaptor, etc. are in the tool group "Fastq manipulation".

From there, you can proceed with mapping. In my opinion, it is enough to filter for proper pairs and remove unmapped after mapping (you will need to do this anyway), rather than bothering to ensure only pairs are mapped at the start. But this is your choice. This example protocol has a BWA mapping step followed by a filter step for proper pairs - so you can see how this is done. It also includes adding read groups (useful for many variant tools; required by some such as those in the GATK group). This data did not require any trimming, but you can add those steps in post-FastQC:

Hopefully this helps you to build up a protocol for your specific project, Jen, Galaxy team

ADD COMMENTlink written 3.5 years ago by Jennifer Hillman Jackson24k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 59 users visited in the last hour