Question: Trying to understand the steps for 'identify DNA polymorphism sites'
15 months ago by
pilargomez91 wrote:

Greetings! My partners and I got this homework. We're not specialized in bioinformatics or anything (pretty amateurs). And we're struggling with some tools. If there is a good soul who would help us my faith in humanity will be.. wait, I never loose faith in humanity, but we would be very thankful.

Our homework consist on the next..

Our data is paired end reads from mother, father and daughter (R1 and R2 of each), by mapping the 3 sets, compare to a proper reference genome and with the right strategies find sites with strong evidence of polymorphism. Utilize VCF for analyzing SNPs, indel, MNP, complex (multiple internal alternative alleles), and the number of the 5 genes with the most polymorphic sites. Including results only when probabilities of a false positive are 1 on 10000 of the Qual column on VCF

Ok.. we found this link with the steps.. \ the workflow is on the link (

Step 1, step 2 and step 3 are completed by now. Can someone please give a brief explanation of why each step is made in stage 1? (stage 2 is pretty clear though). And how to utilize our data.


Identify DNA polymorphic sites

*This page includes description about analyses of DNA polymorphic sites of father-mother-child sequencing samples:

step 1: load data - the data are loaded from local files, set "fastqsanger" format and "hg19" database on the starting page

step 2: check quality of all sequencing files - use FastQC tool (version: 0.63) to check quality of the sequencing

step 3: mapping - use BWA-MEM tool (version: 0.1) to map sequence to reference genome (choose hg19 as reference), paired end

step 4: add or replace read groups - label each group (the mapping file) using AddOrReplaceReadGroup (version: 1.126.0)

step 5: merge 3 individual mapping files - use MergeSamFiles (version: 1.126.0)

step 6: filter - using filter tools: Filter (version: 1.126.0, remove low quality mapping), MarkDuplicates (version: 1.126.0, filter out duplicated mapping), CleanSam (version: 1.126.0)

step 7: identify polymorphic sites - using FreeBayes tool (version: 0.4) to identify polymorphic sites base on hg19 genome

step 8: filter out false positive sites - using VCFfilter (version: 0.0.3) to select sites where the chance of a false positive call is 1 in 10,000 or better.

step 9: extract workflow and download final vcf file for further analyses.

Stage 2 - analyze data of polymorphic sites based on vcf file

step 10: load data - set format as "vcf", genomic database as hg19

step 11: identify number of snp, mnp, del, ins or complex - using VCFfilter tool (version:0.0.3 ) to select different types of polymorphism (for example: -f "TYPE = snp", select snp only), then using Filter tool (version: 1.1.0) to find duplicated polymorphisms

step 12: identify genes with polymorphic sites - using ANNOVAR Annotate VCF tool (version: 0.1) to annotate the vcf file in step 10

step 13: count polymorphic sites for each gene - using Group tool (version: 2.1.0, by gene name) to count number of polymorphic sites for each gene

step 14: sort results in step 13 using Sort tool (version: 1.0.3, by descending).* ...................................

15 months ago by
Jennifer Hillman Jackson wrote:


The first steps involve data preparation for the downstream analysis. Each is described in the tutorial steps.

This set of learning tutorials might help as they cover the same steps in different analysis plus the overview can provide context for how to use the results (through the linked publications, technology summaries, etc):

Also know that many tools are wrapped 3rd party applications. Reviewing the manual for these will almost certain help with the "why use them" questions.

If any steps remain unclear after reviewing the above, please let us know which and we can try to help more.

Thanks, Jen, Galaxy team

