Hello Galaxy Team,
Sorry if this question is all over the place.
I have been trying to align some RNA seq reads (each around 6 million reads in size) against a diploid plant genome (32,928 sequences). However, I cannot seem to get it aligned. Both RNA STAR and TopHat2 give me different error messages. STAR claims this job was terminated because it used more memory than it was allocated, while TopHat 2 seems to have done an alignment only to then say AttributeError: 'module' object has no attribute 'ICM'
, but I'm not sure if those two errors are related.
The only tool that was able to generate an alignment was HISAT2, however I'd rather not use that due to its performance in benchmarking studies.
I suspect the errors in alignment may be caused by a number of overrepresented sequences due to chloroplast DNA contamination, e.g:
Sequence: GGCTTACGGTGGATACCTAGGCACCCAGAGACGAGGAAGGGCGTAGTAAG
Count: 60345
Percentage: 0.9102121596018445
Possible source:No Hit
I have around 8 of such overrepresented sequences per sample. Could this be the reason for the errors? If so, how do I get rid of them? I know Trim Galore! has the option of trimming a custom sequence, but how do I get rid of more than one? Would I have to hook a bunch of Trim Galore!s together as a workflow (one for each overrepresented sequence), or are there any other relevant tools out there?