Workflow to clean data prior to variant call

Question: Workflow to clean data prior to variant call

4.2 years ago by

morgane.moreau.info • 30

Australia

morgane.moreau.info • 30 wrote:

Hi,

I would like some advice on the workflow I'm using.

1- I have WGS data from bacteria, I've uploaded my reference genome and fastq files through FTP.

2- FastQC analysis.

3- FastQ Groomer to get my fastq files in fastqillumina format prior to mapping

4- Mapping to ref. genome with BWA for Illumina

5- Picard alignment summary metrics (not sure how to interpret all of the output yet, but I'll get there).

Now I think I should get rid of unpaired reads and remove duplicates . Or should I have done that before mapping ?

Can I use the "create matched paired end dataset workflow" for that ? (I can only select FastQ groomer files as RAW files. Shouldn't I use the original fastq files ?)

Any feedbacks would be much appreciated.

Morgane

wgs worklfow clean data prior to variant call • 1.4k views

ADD COMMENT • link •

modified 4.2 years ago by Jennifer Hillman Jackson ♦ 25k • written 4.2 years ago by morgane.moreau.info • 30

4.2 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

The format you want to create is ".fastqsanger" prior to mapping with BWA. If you really have data that has quality scores scaled to ".fastqillumina" format, then you will be running the tool "FASTQ Groomer" as a prep step. This wiki explains how to detect format and either assign or covert to the proper one. The steps here would proceed any in the shared example below.
http://wiki.galaxyproject.org/Support#FASTQ_Datatype_QA

In short, once in ".fastqsanger" format, then you can run FastQC again for the actual QC statistics on your data. Use the results to perform QC the sequences as needed (it isn't always). Tools to trim, remove adaptor, etc. are in the tool group "Fastq manipulation".

From there, you can proceed with mapping. In my opinion, it is enough to filter for proper pairs and remove unmapped after mapping (you will need to do this anyway), rather than bothering to ensure only pairs are mapped at the start. But this is your choice. This example protocol has a BWA mapping step followed by a filter step for proper pairs - so you can see how this is done. It also includes adding read groups (useful for many variant tools; required by some such as those in the GATK group). This data did not require any trimming, but you can add those steps in post-FastQC:
http://usegalaxy.org/u/galaxyproject/p/galaxy-101-ngs-variant

Hopefully this helps you to build up a protocol for your specific project, Jen, Galaxy team

ADD COMMENT • link written 4.2 years ago by Jennifer Hillman Jackson ♦ 25k

Similar posts • Search »