4.2 years ago by
SAM->BAM will reduce file size because of compression, so you can't directly compare based on file size. Skipping the mark dups step will only skew analysis results. Not sure which exact tools/workflow you are using, but you can always run a variant caller that is more direct and compare to the current workflow (such as 'Naive Variant Caller' then 'Variant Annnotator' to filter).
An example that compares variant tools is in this tutorial:
To gather stats (actual counts of reads contained), see the SAMtools and Picard tool groups. You can also try a tool like 'Mpileup' to do a direct count-based variant call but this is for DNA samples. Or run tools like 'Depth of Coverage' or 'Create a BedGraph of genome coverage' to just find coverage of a particular region (all, not just variant locations). These tools are better than examining file size.
Also check original data vs mapped data. If concordant alignment pairs are low after mapping, then that is the root of the issue. Could be valid scientifically, or indicate an issue in processing (meaning, tools or parameters should be tuned). Start by checking back through earlier steps to determine where the data loss was introduced. Then try to confirm if it is an actual property of sample versus an issue with how the data was prepped prior to mapping or during mapping (was QC too much or too little; were the quality scores scaled correctly to .fastqsanger; best mapping tool used with parameters that fit data). Maximizing concordant alignments is key in most NGS analysis (if paired input).
Hopefully this helps, Jen, Galaxy team