Question: How to calculate percentage of indels in vcf file generated after mpileup on bam files?
2.3 years ago by
We used genome editing nucleases to alter a specific site. Use of genome editing nucleases led to indel formation in the mouse genome. Further, I performed PCR at the site of interest and performed illumina deep sequencing. My aim is to calculate frequency and indels in the bam files generated after illumina deep sequencing of PCR products.

For identifying different indels, I performed indel calling using mpileup on As are result, a vcf file was generated containing the list of SNPs and indels. I have following two questions:

1. How to calculate frequency and percentage of indels?: To get the frequency of indels, I used the value of 'IMF' provided under info column in vcf files. In vcf files, IMF has been defined as, "Maximum fraction of reads supporting an indel". I multiplied IMF by 100 to get the Maximum percentage of reads supporting an indels. Is my assumption correct that IMF indicates fraction of indels in my PCR products used for deep sequencing? So, multiplication of IMF by 100 should provide percentage of indels in my PCR products.

2. How to calculate total percentage of indels in the PCR products?: Moreover, my PCR products have several different types of indels at the same specific site, due to the use of genome editing nucleases. So, if I sum up the IMF values of all the indels and then multiply the sum by 100, do I get the total percentage of indels in my PCR product? Am I correct in above assumption?

Thanks for help.

2.3 years ago by
This is a complicated question since much depends on how the data was prepared before indel calling. This is not really a Galaxy question, so the advice here is general. In short, be careful about interpreting these statistics based on reads as being descriptive of the original PCR products.

Why? The fraction values are to interpreted as the number indels detected with respect the number of reads input to the tool. Depending on how the sequencing was done and subsequent PCR-duplication removal and alignment quality filtering was done (if any), more than one read could be derived from any single initial PCR product, any particular read could contribute to one or more indel, and there could be duplicated indel detection due to sequencing errors or misalignment. 

Varscan is also on as are GATK tools. All of these tools have usage documentation online and most (plus other indel calling tools) have been compared in multiple publications including some that specifically target indel detection in deep sequencing experiments. A review and comparison of results should help you arrive at the best method and appropriate descriptive statistics.

Best, Jen, Galaxy team


2.3 years ago by
Hi Jen,

Thanks for the answer. I have one more question. Supposedly, all the steps (like PCRs, PCR-duplication removal, alignment quality filtering, sequencing errors or misalignments) have been taken into account perfectly or almost perfectly. Can we consider IMF values equivalent to frequency of indels, since galaxy team has defined IMF as "Maximum fraction of reads supporting an indel"



