Hello. When I analize my data in FASTQC, the SEQUENCE DUPLICATION LEVELS are ok (green lavel and percent of seqs remaininig if deduplicated 75%). Then a use MAP WITH BWA to aling the reads and I get the BAM file. But when I perform MARK DUPLICATES with the BAM file, the PERCENT_DUPLICATION is 0.82, and I don't know why. Can someboby help me, please? Thank you.
Hello,
The tools count up duplicated reads based on different criteria. Also, FastQC only analyzes the first 100-200k reads (depending on the module), where the Picard mark duplicate step only considers mapped reads.
It may help to review the tools to better understand how they function. Links to the manuals are on the tool forms and there is much discussion about QA methods and duplicates, plus how to handle them, included in the Galaxy Tutorials https://galaxyproject.org/learn/ and at general bioinformatics Q&A sites like https://www.biostars.org/ and https://bioinformatics.stackexchange.com/. Some tools also host google groups or other tool-specific discussion sites.
For example, FastQC is described in detail here: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ >> https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/8%20Duplicate%20Sequences.html
Thanks! Jen, Galaxy team
I have a Fastq file with 30,000 reads (less than 100.000). When I analyze them with Fastqc the level of duplicates goes very high. Then I trim the readings to eliminate low quality bases and the levels of duplicates drop to normal levels. I suppose this will be because when I change the length of the readings the program doesn't consider them as duplicates. Then I map the readings with Bowtie2 and I pass the BAM file to MarkDuplicates. Almost all the sequences are mapped but PERCENT DUPLICATION go back up to 80%. How can this be? Thank you.
Is there some part of the original reply that is not clear? In short, "Duplicate" detection does represent the same data evaluation method between the two tools, does not consider the same sequence data at a technical level, and is not expected to have the same percentages (so, cannot be directly compared).
FastQC is a QA tool that runs on subsets of data. The duplicates in the final, complete mapped BAM are the duplicates that you will probably want to get rid of (mark) during a variant analysis workflow. From what you have shared, the original library construction seems to be problematic and produced a high rate of duplication. Up to 50% might be OK, but higher is not great. There can be several factors that can contribute to this type of sequencing result. Reasons are covered in the tutorials, external reference links, and in discussions at general bioinformatics discussion forums such as https://www.biostars.org/ and https://bioinformatics.stackexchange.com/.
You may want to contact the lab that created the data and share your results, especially if this was done by a vendor. Public datasets might also be problematic, but there is not much that can be done about those, the data authors/source are probably already aware that the data are not of high diversity/content quality.
I know that FastQC only analyzes the first 100.000 reads and mark duplicate only considers mapped reads. But I only have 30.000 reads, therefore all the readings are taken into account. On the other hand all the readings are mapped. Why in this case both tools do not consider the same data? I think the data is practically the same., and I don't know why are there hardly any duplicates for one program and for the other there is 80%.
FastQC looks for sequencing duplicates. If exact, meaning the same nucleotide string, those are considered sequencing duplicates. These can also sometimes be counted up in the FastQC module "Overrepresented", but not always.
Mark duplicates examines the portion of the sequence that actually mapped to the target genome and the mapping position/CIGAR. If exact, then those reads are considered optical duplicates.
Trimming the sequences to remove artifact is not enough to clear up optical duplications. Trimming artifact can help more reads to map, although most modern NGS mappers will just ignore short portions of ends that are not represented in the target genome mapped against (during mapping, advanced settings can be tuned to make this more or less stringent).
There are not overrepresented reads in Fastq results and there are 0 optical duplicates in Markduplicates.
Is it possible that MarkDuplicates only take into account the 5' coordinate of the reads to decide if they are duplicated and for this reason reads that were not considered duplicated in Fastqc because the 3' ends are cut in different lengths are considered duplicates in MarkDuplicates for having the same 5 'coordinates?