Question: Can contamination cause alignment errors and how can I trim contaminants away?
gravatar for Mesohexaploid
11 weeks ago by
Mesohexaploid 10 wrote:

Hello Galaxy Team,

Sorry if this question is all over the place.

I have been trying to align some RNA seq reads (each around 6 million reads in size) against a diploid plant genome (32,928 sequences). However, I cannot seem to get it aligned. Both RNA STAR and TopHat2 give me different error messages. STAR claims this job was terminated because it used more memory than it was allocated, while TopHat 2 seems to have done an alignment only to then say AttributeError: 'module' object has no attribute 'ICM', but I'm not sure if those two errors are related.

The only tool that was able to generate an alignment was HISAT2, however I'd rather not use that due to its performance in benchmarking studies.

I suspect the errors in alignment may be caused by a number of overrepresented sequences due to chloroplast DNA contamination, e.g:

Percentage: 0.9102121596018445 Possible source:No Hit

I have around 8 of such overrepresented sequences per sample. Could this be the reason for the errors? If so, how do I get rid of them? I know Trim Galore! has the option of trimming a custom sequence, but how do I get rid of more than one? Would I have to hook a bunch of Trim Galore!s together as a workflow (one for each overrepresented sequence), or are there any other relevant tools out there?

ADD COMMENTlink modified 11 weeks ago by Jennifer Hillman Jackson24k • written 11 weeks ago by Mesohexaploid 10
gravatar for Jennifer Hillman Jackson
11 weeks ago by
United States
Jennifer Hillman Jackson24k wrote:


This sounds like a content or format issue with the inputs triggering a memory intensive job. Please see:

Tophat is considered a deprecated tool replaced by HISAT. RNASTAR is not recommended for most use-cases. It utilizes memory differently and can fail for a memory-related reason when HISAT will not, even when given the same inputs.

A few overrepresented sequences are unlikely to be the root cause of the problem but you could test out removing them to see if that makes a difference (Trim Galore is not really intended to be used this way, but some do use it anyway with the method you describe - e.g. one sequence at a time). How to perform NGS QA/QC is covered in this linked tutorial and analysis-specific QA is included in many other tutorials:

If after reviewing you cannot determine the source of the problem, and the work is at Galaxy Main ( or can be reproduced there, a bug report can be sent in for more feedback.

Thanks! Jen, Galaxy team

ADD COMMENTlink modified 11 weeks ago • written 11 weeks ago by Jennifer Hillman Jackson24k

Hello Jennifer,

Thanks for your swift response! I'll stick with HISAT2 for now. Unfortunately, the BAM file produced by HISAT2 only contained a list of QNAME but nothing else. I'm still trying to determine what is wrong with either the reference sequence or my reads to have caused this.

The reference sequence: -is normalized -is ordered into 9 chromosomes, with an extra segment (about the size of a chromosome -12,200 KB) containing short, unsorted reads

The RNAseq reads: -fastqsanger format, paired reads, pairs were groomed and then merged, once merged they were trimmed for quality -around 2.5 GB -FASTQC quality report- features an "X" for per tile quality, per base content, duplication levels, overrepresented sequences and kmer content.

Could it be the segment of unsorted sequences in the reference causing the complete lack of alignment, or is it some read quality aspect that is most likely the culprit? Would it help if I turn my normalized reference sequence into a custom build for the alignment? I guess I'll just keep messing around with my data until I something works...

ADD REPLYlink written 11 weeks ago by Mesohexaploid 10

Are you running the fastq data as two inputs (forward reads in one dataset, reverse reads in another)? Try that if not.

ADD REPLYlink written 11 weeks ago by Jennifer Hillman Jackson24k

Thanks again Jennifer,

So I interlaced my reads after grooming, then trimmed them for quality, then de-interlaced them and separately loaded the forward and reverse reads into HISAT2. I almost thought the output was empty again but then I scrolled really far down and, lo and behold, alignments had been detected! Not very many though, and they don't seem to appear when we open them in IGV.

In conclusion, it seems that Galaxy's tools are working and it's just something about the reference sequence (B. oleracea v2) or our RNAseq reads that is off. We'll try to run it again with a different reference as well as with a different set of RNAseq reads and see what happens. I'll also see what happens if I use the same workflow to align some Arabidopsis sequences...

Thanks for your help!

ADD REPLYlink written 11 weeks ago by Mesohexaploid 10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 101 users visited in the last hour