Question: Correct way of merging samples for father, mother, child trio variant calling
gravatar for eurioste
17 months ago by
eurioste40 wrote:

I am new to NGS data analysis and I'm working in a multiple-sample variant calling workflow. I have Illumina-Miseq fastq files (paired end, raw reads) for a father, mother and child trio, one pair for each individual, totalling 6 files. I could trim, align, do the pre-processing and variant calling for each individual pair separately (I'm skipping indel-realignment and quality recalibration, for the sake of simplicity, as this workflow is intended for learning only), but I wish to merge the samples into a single file. I wish that the alignment step (with BWA-MEN), the pre-processing steps (with Picard) and the variant calling step (with FreeBayes), are done at once for all samples, if possible and correct, while taking in consideration the correct paired end mates and the respective read groups (when applicable).

My final goal is to obtain a single vcf file from which I'll compute the total number of different kinds of variants.

At which step, in which file format and with which Galaxy tools can I merge the samples in a manner that I can get correct results at the variant calling step?

merge variant calling vcf bam • 781 views
ADD COMMENTlink modified 17 months ago by Guy Reeves1.0k • written 17 months ago by eurioste40
gravatar for Guy Reeves
17 months ago by
Guy Reeves1.0k
Guy Reeves1.0k wrote:

I am not too sure that there is a single correct way to do this. Basically I think the plan you have is fine. Personally I do not do any preprocessing steps before mapping (the soft clipping capacity of the mappers remove the need). After the mapping merge the BAMs (makeing sure you have any readgroup issues straight). "MergeSamFiles merges multiple SAM/BAM datasets into one"should do, use bam as input and output.

Then jointly call the potential variants on the merged bam using freebayes, this will output a single vcf with the trio. You may well want to filter this vcf of potential variants afterwards

While GATK is in general a problem on galaxy there is one stand alone tool than has an interesting capacity for trios. Select Variants from VCF files (Galaxy Version 0.0.3) it is on If you input a ped file you can "output mendelian violation sites only". It is also a good tool for filtering out SNPs and other classes of variants.

ADD COMMENTlink modified 17 months ago • written 17 months ago by Guy Reeves1.0k

Just to add a bit: Definitely, assign read groups before merging BAM/SAM data. Tutorials can be found here with examples:

ADD REPLYlink written 17 months ago by Jennifer Hillman Jackson25k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 175 users visited in the last hour