Problem running Clustal 2.1 Multiple Sequences Alignment

Question: Problem running Clustal 2.1 Multiple Sequences Alignment - an instance of 'std::bad_alloc'

3.5 years ago by

Canada

annadv77 • 20 wrote:

Dear All,

I've been trying to analyze reads from a short transcript. The data I have was obtained by using MiSeq machine, and it is paired end (2 separate files). I am new to RNAseq analysis, so I was advised to do the following:

- trim off the primers and any adaptor sequence
- assemble the two overlapping reads to get a consensus sequence for each fragment
- discard any low quality data that remains
- align the consensus sequences to your reference sequence

I've performed the following steps by using public Galaxy:

1. Removed the adapters with primers by using the Clip tool.
2. I ran FASTQ joiner tool to combine both files into one.
3. This was followed by filtering by quality (FASTQ filter by quality tool).
4. Converted fastq to fasta by using FASTQ to FASTA tool
5. Attempted to run Clustal 2.1 to perform multiple sequence alignment.

Here (after step 5), the output was empty and I have gotten the following error message at the end of the log file:

CLUSTAL 2.1 Multiple Sequence Alignments

Sequence type explicitly set to DNA
Sequence format is Pearson
Sequence 1: 1             38 bp
Sequence 2: 2             38 bp
Sequence 3: 3             96 bp
Sequence 4: 4             69 bp
.........
Sequence 126812: 126812       180 bp
Sequence 126813: 126813       180 bp
Sequence 126814: 126814       180 bp
Sequence 126815: 126815       153 bp
Sequence 126816: 126816       180 bp
Sequence 126817: 126817        69 bp
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Start of Pairwise alignments
Aligning...
Could not allocate a distance matrix for 126817 seqs. Need to terminate program.

Could anybody, please, explain me what is the problem with my workflow?

Thank you very much!

Anna

alignment • 2.4k views

ADD COMMENT • link •

modified 3.5 years ago by Jennifer Hillman Jackson ♦ 25k • written 3.5 years ago by annadv77 • 20

3.5 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

This type of error generally indicates a memory problem. Meaning - the job exceeded the available memory necessary to execute the tool.

There are other potential problems with the workflow. The Clustal was expecting DNA input (not RNA). Also, joining paired ends is not the same as assembling them. So the actual workflow does not quite match up with the generalized analysis pathway you shared.

What is your goal? To assemble the reads into consensus sequences? Trinity is a good option.

Thanks, Jen, Galaxy team

ADD COMMENT • link written 3.5 years ago by Jennifer Hillman Jackson ♦ 25k

Hello Jen,

Thank you very much for your answer!

My final goal is to tally variants at specific locations of targeted mutations.

Following your suggestion, I am going to correct the workflow I was following into this:

1. I will remove the adapters with primers by using the Clip tool. (same as previously)
2. I will use Trinity to perform de novo assembly.
3. I will use Trinity to perform quality assessment.
4. I will have to perform the alignment of the consensus sequences obtained from the previous step to the reference sequence (~250bp)

Is this correct?
If so - may I ask what would you recommend to use for the alignment of the consensus sequences to the reference sequence? If I'm not mistaken, Trinity performs only de novo assembly?

What should be used for tallying variants at specific locations? I was planning to use the Motif Tools: Sequence Logo tool - would it be the right approach?

Thank you very much for all your help!

Regards,

Anna

ADD REPLY • link written 3.5 years ago by annadv77 • 20

Hi Anna,

If the data is from multiple samples, you might not want to assemble in batch, or at all, since this will mix up the samples together, making a trace-back to a particular sample/condition difficult if not impossible.

If this is the case, using a "QC > map > variant calling > (optional) annotation" type of workflow would probably be a better choice.

Is the data sequenced really RNA and not DNA? I understand the target is a particular transcript (if I understood you correctly), but both are possible sequencing options - it depends on what the samples were based on. Meaning, was the library prep done by targeting a genomic region or based on a RNA library specific to the transcript? One sample/condition or many? These details make a difference in the type tools you can use for mapping/variant calls. Search the tool panel with the keywords RNA, DNA, and then VCF to get an idea about they type of tools available for each. And reviewing publications that performed the same analysis will give you an idea about what others are doing, so you can make the best choices. There are also several example protocols using Galaxy in the wiki, see the Learn area. I would also suggest reviewing the new NGS 101 wiki as well, to better understand the options.

Sequence Logo can be great for graphics/visual data reduction - but the calls would be best made by using one of the variant calling tools/protocols, as far as I know. So use that tool last if wanted.

Jen

ADD REPLY • link modified 3.5 years ago • written 3.5 years ago by Jennifer Hillman Jackson ♦ 25k

Hello Jen,

Thank you very much for the answer and for the links! I've seen some of the information before, but I'm going to check it more thoroughly.

Thank you!

Regards,

Anna

ADD REPLY • link written 3.5 years ago by annadv77 • 20

Similar posts • Search »