Joining overlapping paired end reads

Question: Joining overlapping paired end reads

3.5 years ago by

Canada

annadv77 • 20 wrote:

Dear All,

I have DNA sequencing data (targeted sequencing) produced by MiSeq. The size of the insert is 250bp, it was sequenced by using paired end reads and the reads are overlapping. I am planning to use velvet on Galaxy to perform de novo assembly. As I understand, velvet requires the input to be in one file, so the paired end reads must be joined prior to the assembly.

I've been trying to use fastq-join tool in Galaxy; however, it keeps giving error messages, saying that the ids in one file cannot be found in the other file.

I was wondering what should I do to make the fastq-join tool to run successfully? Does it need all the reads to be sorted in both files and does it need only the reads that appear in both files?

If so, how could I perform these operations in Galaxy?

Thank you very much for the help!

Regards,

Anna

overlapping paired end reads galaxy fastq_join • 4.5k views

ADD COMMENT • link •

modified 3.5 years ago by Guy Reeves • 1.0k • written 3.5 years ago by annadv77 • 20

3.5 years ago by

Bjoern Gruening ♦ 5.1k

Germany

Bjoern Gruening ♦ 5.1k wrote:

Hi Anna,

if you can install tools into your Galaxy instance you can use PEAR, available from the Galaxy Tool Shed.

https://toolshed.g2.bx.psu.edu/view/iuc/pear/

Cheers,

Bjoern

ADD COMMENT • link written 3.5 years ago by Bjoern Gruening ♦ 5.1k

Hi Bjoern,

Thank you very much for your suggestion! I will try PEAR for sure - I read about it, I just couldn't find it in the Tool shed for some reason.

Thank you for providing the link!

Regards,

Anna

ADD REPLY • link written 3.5 years ago by annadv77 • 20

Hi Bjoern,

Thank you for suggesting to use PEAR from the Tool Shed - I've installed it and it working very well! I actually ended up not needing it for the project I intended to use it for, but it worked well on another set of data.

Thank you!

Regards,

Anna

ADD REPLY • link written 3.5 years ago by annadv77 • 20

Anna, thanks for letting me know it's working! I really appreciate any feedback!

ADD REPLY • link written 3.5 years ago by Bjoern Gruening ♦ 5.1k

3.5 years ago by

Guy Reeves • 1.0k

Germany

Guy Reeves • 1.0k wrote:

Hi Anna

Ideally your paired ends reads should not be joined (or merged ), particularly if you plan to exploit the benefit of paired-end segueing to make a de novo assembly. Though if your insert size is only 250bp there may be only limited benefit. I think what you want to do for Velvet is a file were 'each read is paired with the one directly above or the one directly below.'-this file is a merge of two FASTQ files, but the reads them selves are not merged. I think the following tools might help.

This tool will merge the FASTQ files (but they will not be sorted!)

NGS: Picard (beta)>FASTQ to BAM

This tool will sort the BAM file by read name.

NGS: SAMtools> Sort BAM dataset sort by: read names.

If you want to check that everything worked out you can if you want convert the BAM files in to SAM ones at any stage to make it easier to visualise what has gone on. I believe either of these formats is accepted by Velvet.

Thanks Guy

ADD COMMENT • link written 3.5 years ago by Guy Reeves • 1.0k

Hi Guy,

Thank you for your detailed reply!
I am wondering whether the NGS:Picard (beta) > FASTQ to BAM tool will be able to deal with overlapping paired ends? Because I know that probably most of them are overlapping and by relatively long sequences (I think).

Thank you!

Regards,

Anna

ADD REPLY • link written 3.5 years ago by annadv77 • 20

Hi Anna

To be honest a I cannot see any circumstances where it is desirable to merge overlapping reads for genome assembly (unless your read length is really really short >30bp).

It is just an inefficient thing to do, it leaves you will almost no capacity to have contigs span repetitive regions longer than your merged read length, it also has the potential to exacerbate problems with PCR error in low coverage regions and you end up doubling the amount of sequencing you need to do if the overlap is complete. I think these concerns are reflected in the Velvet advice " the use of paired-end reads is strongly recommended to obtain longer contigs, especially in repetitive regions."

Others may have diffrent options on this but is worth thinking about why paired read strategies are generally employed for assembly rather than single reads (which is effectively what you are doing if you merge reads). In general your the insert size of most of your sequencing libraries should be more than x2 the read length you are using, this removes this overlap problem and maximises the power of paired end sequencing. If this is not the case for your libraries then you may have limited capacity to deal with repetitive regions and you will have to do twice as much sequencing. Thanks Guy

ADD REPLY • link modified 3.5 years ago • written 3.5 years ago by Guy Reeves • 1.0k

Hi Guy,

Thank you very much for your detailed reply.

I agree with your points. Especially about having reads that are shorter than half the length of the insert. Unfortunately, I was not involved in the design of the experiment, and I was asked to work with the data, when it was already obtained.

Regarding merging the overlapping reads - thank you for the advice. I was trying to do that, since I read on one of the bioinformatics blog sites, that merging overlapping reads and then performing de novo assembly with velvet provides better results than performing the assembly without the merge. At that time I was wondering whether that was logical, but your explanation convinced me that it would be preferable not to merge prior to performing the de novo assembly.

Regards,

Anna

ADD REPLY • link written 3.5 years ago by annadv77 • 20

Hi Anna

I do not see why FASTQ to BAM tool would have a problem it just mergers the FASTQ files and makes a BAM file. It is not doing anything clever. But the best way to check is probably to view SAM files Thanks Guy

ADD REPLY • link written 3.5 years ago by Guy Reeves • 1.0k

Please log in to add an answer.

Similar posts • Search »