Question: Joining overlapping paired end reads
1
gravatar for annadv77
3.5 years ago by
annadv7720
Canada
annadv7720 wrote:

Dear All,

 

I have DNA sequencing data (targeted sequencing) produced by MiSeq. The size of the insert is 250bp, it was sequenced by using paired end reads and the reads are overlapping. I am planning to use velvet on Galaxy to perform de novo assembly. As I understand, velvet requires the input to be in one file, so the paired end reads must be joined prior to the assembly.

I've been trying to use fastq-join tool in Galaxy; however, it keeps giving error messages, saying that the ids in one file cannot be found in the other file.

I was wondering what should I do to make the fastq-join tool to run successfully? Does it need all the reads to be sorted in both files and does it need only the reads that appear in both files?

If so, how could I perform these operations in Galaxy?

 

Thank you very much for the help!

 

Regards,

Anna

ADD COMMENTlink modified 3.5 years ago by Guy Reeves1.0k • written 3.5 years ago by annadv7720
1
gravatar for Bjoern Gruening
3.5 years ago by
Bjoern Gruening5.1k
Germany
Bjoern Gruening5.1k wrote:

Hi Anna,

if you can install tools into your Galaxy instance you can use PEAR, available from the Galaxy Tool Shed.

https://toolshed.g2.bx.psu.edu/view/iuc/pear/

Cheers,

Bjoern

ADD COMMENTlink written 3.5 years ago by Bjoern Gruening5.1k

Hi Bjoern,

 

Thank you very much for your suggestion! I will try PEAR for sure - I read about it, I just couldn't find it in the Tool shed for some reason.

Thank you for providing the link!

 

Regards,

Anna

ADD REPLYlink written 3.5 years ago by annadv7720

Hi Bjoern,

Thank you for suggesting to use PEAR from the Tool Shed - I've installed it and it working very well! I actually ended up not needing it for the project I intended to use it for, but it worked well on another set of data.

Thank you!

 

Regards,

Anna

ADD REPLYlink written 3.5 years ago by annadv7720

Anna, thanks for letting me know it's working! I really appreciate any feedback!
 

ADD REPLYlink written 3.5 years ago by Bjoern Gruening5.1k
1
gravatar for Guy Reeves
3.5 years ago by
Guy Reeves1.0k
Germany
Guy Reeves1.0k wrote:

Hi Anna

Ideally your paired ends reads should not be joined (or merged ),  particularly if you plan to exploit the benefit of paired-end segueing to make a de novo assembly.  Though if your insert size is only 250bp there may be only limited benefit.  I think what you want to do for Velvet is a file were  'each read is paired with the one directly above or the one directly below.'-this file is a merge of two FASTQ files, but the reads them selves are not merged.  I think the following tools might help.

This tool will merge the FASTQ files (but they will not be sorted!)

NGS: Picard (beta)>FASTQ to BAM 

This tool will sort the BAM file  by read name.

NGS: SAMtoolsSort BAM dataset   sort by: read names.

If you want to check that everything worked out  you can if you want convert the BAM files in to SAM ones at any stage  to make it easier to visualise what has gone on.  I believe either of these formats is accepted by Velvet.

Thanks Guy

 

ADD COMMENTlink written 3.5 years ago by Guy Reeves1.0k

Hi Guy,

Thank you for your detailed reply!
I am wondering whether the NGS:Picard (beta) > FASTQ to BAM tool will be able to deal with overlapping paired ends? Because I know that probably most of them are overlapping and by relatively long sequences (I think).

Thank you!

 

Regards,

Anna

ADD REPLYlink written 3.5 years ago by annadv7720

Hi Anna

​To be honest  a I cannot see any circumstances where it is desirable to merge overlapping reads for genome assembly (unless your read length is really  really short >30bp).

 It is just an inefficient thing to do, it leaves you will almost no capacity to have contigs span repetitive regions longer than your merged read length, it also has the potential to exacerbate  problems with PCR error in low coverage regions and you end up doubling the amount of sequencing you need to do if the overlap is complete.  I think these concerns are reflected in the Velvet advice " the use of paired-end reads is strongly recommended to obtain longer contigs, especially in repetitive regions."   

Others may have diffrent options on this but is worth thinking about why paired read strategies are generally employed for assembly rather than single reads (which is effectively what you are doing if you merge reads).  In general  your the insert size of most of your sequencing libraries should be more than x2 the read length you are using, this removes this overlap problem and maximises the power of paired end sequencing.  If this is not the case for your libraries then you may have limited capacity to deal with repetitive regions and you will have to do twice as much sequencing. Thanks  Guy

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by Guy Reeves1.0k

Hi Guy,

Thank you very much for your detailed reply.

I agree with your points. Especially about having reads that are shorter than half the length of the insert. Unfortunately, I was not involved in the design of the experiment, and I was asked to work with the data, when it was already obtained.

Regarding merging the overlapping reads - thank you for the advice. I was trying to do that, since I read on one of the bioinformatics blog sites, that merging overlapping reads and then performing de novo assembly with velvet provides better results than performing the assembly without the merge. At that time I was wondering whether that was logical, but your explanation convinced me that it would be preferable not to merge prior to performing the de novo assembly.

Regards,

Anna

ADD REPLYlink written 3.5 years ago by annadv7720

Hi Anna

I do not see why FASTQ to BAM tool would have a problem it just mergers the FASTQ files and makes a BAM file.  It is not doing anything clever.   But the best way to check is probably to view SAM files Thanks  Guy

 

ADD REPLYlink written 3.5 years ago by Guy Reeves1.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 170 users visited in the last hour