Hello. When I analize my data in FASTQC, the SEQUENCE DUPLICATION LEVELS are ok (green lavel and percent of seqs remaininig if deduplicated 75%). Then a use MAP WITH BWA to aling the reads and I get the BAM file. But when I perform MARK DUPLICATES with the BAM file, the PERCENT_DUPLICATION is 0.82, and I don't know why. Can someboby help me, please? Thank you.
The tools count up duplicated reads based on different criteria. Also, FastQC only analyzes the first 100-200k reads (depending on the module), where the Picard mark duplicate step only considers mapped reads.
It may help to review the tools to better understand how they function. Links to the manuals are on the tool forms and there is much discussion about QA methods and duplicates, plus how to handle them, included in the Galaxy Tutorials https://galaxyproject.org/learn/ and at general bioinformatics Q&A sites like https://www.biostars.org/ and https://bioinformatics.stackexchange.com/. Some tools also host google groups or other tool-specific discussion sites.
For example, FastQC is described in detail here: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ >> https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/8%20Duplicate%20Sequences.html
Thanks! Jen, Galaxy team
I have a Fastq file with 30,000 reads (less than 100.000). When I analyze them with Fastqc the level of duplicates goes very high. Then I trim the readings to eliminate low quality bases and the levels of duplicates drop to normal levels. I suppose this will be because when I change the length of the readings the program doesn't consider them as duplicates. Then I map the readings with Bowtie2 and I pass the BAM file to MarkDuplicates. Almost all the sequences are mapped but PERCENT DUPLICATION go back up to 80%. How can this be? Thank you.