I am performing a quality check in a transcriptomic dataset before attempting a de-novo assembly. Duplication is high, as expected, but I did not expect this results in the k-mer graph:
No sequence is overrepresented in FastQC, whereas this makes me think that reads starting with "CCGACTTTGGACGAG" are overrepresented. Trimming, while giving very good results in other aspects, does not solve this problem, these are the results (sliding window; 15 bp headcrop; minmmum length applied).
What do you think that may be causing this? How would you continue?
Thank you in advance.