I am analysing RNA-seq datasets for the differential splicing events
between cell types. My reads are 36bp long. In order to increase the
quality of reads, I need to trim some nucleotides from ends. How many
nucleotides can I trim? I am afraid that if I trim too much, the
reliability of the alingment will be affected.
Thanks in advance.
This general protocol is also in the RNA-seq tutorial:
--> Understanding and QCing the reads
That said, I had a sample of your data from before and I ran FastQC on
it and see what you mean, the quality drops off steadily after the
10 bases or so, then below phred+20 around the middle of the sequence
(for both ends).
There are a few options -
1 - Do as Ann suggests and just leave these alone and test to see what
happens in TopHat. If the mapping fails, then you will know that you
need to do some quality cleanup.
2 - Use the FastQC results to decide on a lower quality score boundary
and trim the very worst sequences. Because of the length, yes, take
not to remove too much. As I stated, from the sample I looked at, even
phred+20 would probably clip too aggressively.
In general it is best to do as little manipulation as possible with
expression data. Some testing on your part will be needed to identify
the correct processing, and the same process will not apply to all
datasets. But the general path outlined in the tutorial is a good one
for what you are trying to do and should be able to address your