Preprocessing of mate paired end read sequences for de novo genome assembly

Question: Preprocessing of mate paired end read sequences for de novo genome assembly

4.4 years ago by

United States

I have just started out in the field of genome assembly, so please bear with my lack of knowledge about the subject. I have been put to the task of assembling de novo the genome of a diatom with both mate paired end reads and paired end reads. I think have a workflow figured out for the paired end reads using velvet as explained below:

Read QC
Trimmer
Read QC
Cutadapt
Read QC
Resync
velveth (hash = 35)
velvetg
assembly stats

However, now I want to improve the quality of the assembly using sequence data from mate paired end reads and don't know how I should prep the data for assembly or what steps I should take after that. Some basic stats on the reads are:

2 mate pair libraries prepared. 1 was selected for 3-5kb inserts and the other 5-10kb
The project was sequenced on a 100bp PE
Generated >160M reads for the lane
Average quality scores are 37

I am not sure what other data about the reads I can include, but any help would be very appreciated! I have found information on the topic that deals with RNA-seq data, but nothing so far on genomic data.

Thanks in advance,

Marnie

assembly genome paired mate de novo • 2.9k views

ADD COMMENT • link •

modified 4.4 years ago by Jennifer Hillman Jackson ♦ 25k • written 4.4 years ago by Marnie Plunkett • 0

4.4 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hi Marnie,

This publication from Gigascience has a great deal of comparative assembler data (and includes the exact methods used to generate each dataset). It is about a year old, and doesn't cover all choices available now (or then) but has very good advice all around at both the detail and summary level: http://www.gigasciencejournal.com/content/2/1/10

It is important to note that the above was based on vertebrate assemblies, and one of the key findings was that different assembly methods rank differently even among that group. Concluding that assembler performance depends much on the particular genome undergoing assembly. However, this should give you some leads to follow. The two diatom genome assemblies completed to date were performed by JGI using proprietary software, so some testing with open source tools will be necessary.

The Tool Shed offers several assembler choices that accept mixed source data inputs. It is probably easier for you to just browse and review the tools (and click through to the underlying binary tool's documentation) that for me to list them out. Look under the group "Assembly". These tools are to be run in on a CloudMan Galaxy (or local production Galaxy with sufficient resources). Perhaps try two and compare, if you have the resources.
http://toolshed.g2.bx.psu.edu/
http://usegalaxy.org/toolshed
http://usegalaxy.org/cloud

For data prep, basic QA/QC for DNA is in many ways the same as for RNA. Clip the poorest quality ends as needed (conservatively - you don't want to lose data over a few bad base calls, especially regions that the assembler can still align and resolve correctly in the consensus through coverage - bridging gaps will be important in contig building) and remove any artifact (this will almost always degrade assembly quality - and contribute to fragmentation). A few runs should inform you of the optimal quality clipping setting to use.

Good luck with your project, Jen, Galaxy team

ADD COMMENT • link written 4.4 years ago by Jennifer Hillman Jackson ♦ 25k

Thank you! This material is very useful.

ADD REPLY • link written 4.4 years ago by Marnie Plunkett • 0

Similar posts • Search »