Hi all, I'm currently analysing some RNAseq samples which were amplified prior to library prep and adapter ligation. Having filtered out the adapter sequences, I'm left with some very specific kmer contamination: CTTCAG starting at position 15 (with all bases on either side equally represented). I had a look at the sequences containing this kmer in that location and there are no individual sequences that are excessively abundant and the top few all map to common genes.
It seems to be coming from one of the amplification primers, despite the primer removal reaction: 3’-GACTTCNNNNNNNNNNNNNN (http://www.sigmaaldrich.com/technical-documents/protocols/biology/seqr.html)
Could anybody help me to understand how this is affecting my data and how I can control for it? Has this primer attached itself to the end of my sequences or has it caused an amplification bias towards genes containing that motif? Is there anything I can do to remove this contamination without losing a substantial portion of my data, or can I discount it because it seems to be consistent across samples?
Any insight would be much appreciated as I don't have much background in molecular biology.