Question: Preprocessing of mate paired end read sequences for de novo genome assembly
gravatar for Marnie Plunkett
4.4 years ago by
United States
Marnie Plunkett0 wrote:

I have just started out in the field of genome assembly, so please bear with my lack of knowledge about the subject. I have been put to the task of assembling de novo the genome of a diatom with both mate paired end reads and paired end reads. I think have a workflow figured out for the paired end reads using velvet as explained below:

  1. Read QC
  2. Trimmer
  3. Read QC
  4. Cutadapt
  5. Read QC
  6. Resync 
  7. velveth (hash = 35)
  8. velvetg 
  9. assembly stats 

However, now I want to improve the quality of the assembly using sequence data from mate paired end reads and don't know how I should prep the data for assembly or what steps I should take after that. Some basic stats on the reads are:

  1. 2 mate pair libraries prepared. 1 was selected for 3-5kb inserts and the other 5-10kb
  2. The project was sequenced on a 100bp PE
  3. Generated >160M reads for the lane
  4. Average quality scores are 37

I am not sure what other data about the reads I can include, but any help would be very appreciated! I have found information on the topic that deals with RNA-seq data, but nothing so far on genomic data.

Thanks in advance,


assembly genome paired mate de novo • 2.9k views
ADD COMMENTlink modified 4.4 years ago by Jennifer Hillman Jackson25k • written 4.4 years ago by Marnie Plunkett0
gravatar for Jennifer Hillman Jackson
4.4 years ago by
United States
Jennifer Hillman Jackson25k wrote:

Hi Marnie,

This publication from Gigascience has a great deal of comparative assembler data (and includes the exact methods used to generate each dataset). It is about a year old, and doesn't cover all choices available now (or then) but has very good advice all around at both the detail and summary level:

It is important to note that the above was based on vertebrate assemblies, and one of the key findings was that different assembly methods rank differently even among that group. Concluding that assembler performance depends much on the particular genome undergoing assembly. However, this should give you some leads to follow. The two diatom genome assemblies completed to date were performed by JGI using proprietary software, so some testing with open source tools will be necessary.

The Tool Shed offers several assembler choices that accept mixed source data inputs. It is probably easier for you to just browse and review the tools (and click through to the underlying binary tool's documentation) that for me to list them out. Look under the group "Assembly". These tools are to be run in on a CloudMan Galaxy (or local production Galaxy with sufficient resources). Perhaps try two and compare, if you have the resources.

For data prep, basic QA/QC for DNA is in many ways the same as for RNA. Clip the poorest quality ends as needed (conservatively - you don't want to lose data over a few bad base calls, especially regions that the assembler can still align and resolve correctly in the consensus through coverage - bridging gaps will be important in contig building) and remove any artifact (this will almost always degrade assembly quality - and contribute to fragmentation). A few runs should inform you of the optimal quality clipping setting to use.

Good luck with your project, Jen, Galaxy team

ADD COMMENTlink written 4.4 years ago by Jennifer Hillman Jackson25k

Thank you! This material is very useful. 

ADD REPLYlink written 4.4 years ago by Marnie Plunkett0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 170 users visited in the last hour