Depth of Coverage

Question: Depth of Coverage

4.2 years ago by

European Union

I am noticing that DP of those sites that will become eventually later variant calls are very much reduced after certain steps of my workflow (between SAM-to-BAM file and MarkDups file for example) and this is evident also from the fact that the file is reduced from 1.5 Giga becomes 195 mega. I was wondering wether there is an automatic reduction of reads made by Galaxy and wether there is a way to recover the real DP after the variant calling process. Thanks a lot

snp • 1.3k views

ADD COMMENT • link •

modified 4.2 years ago by Jennifer Hillman Jackson ♦ 25k • written 4.2 years ago by mariano.avino • 0

4.2 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hi,

SAM->BAM will reduce file size because of compression, so you can't directly compare based on file size. Skipping the mark dups step will only skew analysis results. Not sure which exact tools/workflow you are using, but you can always run a variant caller that is more direct and compare to the current workflow (such as 'Naive Variant Caller' then 'Variant Annnotator' to filter).

An example that compares variant tools is in this tutorial:
http://usegalaxy.org/u/galaxyproject/p/galaxy-101-ngs-variant

To gather stats (actual counts of reads contained), see the SAMtools and Picard tool groups. You can also try a tool like 'Mpileup' to do a direct count-based variant call but this is for DNA samples. Or run tools like 'Depth of Coverage' or 'Create a BedGraph of genome coverage' to just find coverage of a particular region (all, not just variant locations). These tools are better than examining file size.

Also check original data vs mapped data. If concordant alignment pairs are low after mapping, then that is the root of the issue. Could be valid scientifically, or indicate an issue in processing (meaning, tools or parameters should be tuned). Start by checking back through earlier steps to determine where the data loss was introduced. Then try to confirm if it is an actual property of sample versus an issue with how the data was prepped prior to mapping or during mapping (was QC too much or too little; were the quality scores scaled correctly to .fastqsanger; best mapping tool used with parameters that fit data). Maximizing concordant alignments is key in most NGS analysis (if paired input).

Hopefully this helps, Jen, Galaxy team

ADD COMMENT • link written 4.2 years ago by Jennifer Hillman Jackson ♦ 25k

Similar posts • Search »