De novo transcriptome assembly and reference guided transcriptome assembly

Question: De novo transcriptome assembly and reference guided transcriptome assembly

4.5 years ago by

edoade_2014 • 30

European Union

edoade_2014 • 30 wrote:

Hi,

I have four related questions about de novo RNAseq data analysis. I have 4 RNAseq data obtained from 4 closely related insect species, for each data I have 3 biological replicates. I have the genome sequence (chromosome sequences) for only one of these species, while I have the gene set annotations (gtf) for all species. I would like to do a de novo transcriptome assembly, as well as a reference sequence guided transcriptome assembly.

For the reference guided transcriptome assembly I was thinking of doing the following procedure: Reads alignment to the sole reference genome I have, using Tophat for all the species in object. Followed by cufflinks, cuffmerge and cuffdiff. My first question is: Is it correct if in cufflinks, for each species I use the specific gene set annotations (gtf)? Having previously used in Tophat the sole genome sequence (1 species) I have. In other words, can I use in cufflinks the gene set annotations of each species together with the BAM files originated in Tophat with the chromosome sequence of another species?

Having 3 replicates for each tissue/species, my second question is: would you suggest to merge the 3 replicates before mapping them with Tophat?, as a means to increase the number of reads to be used for mapping or would you merge the replicates only afterwards in cuffmerge?

Third question: Would you use the same procedure as detailed before but omitting the gene set annotations (gtf) in cufflinks and consider the all as a de novo transcriptome assembly?

Fourth question: Any suggestion on which de novo assembler to be used, that is present in Galaxy ToolShed?

Sorry for merging all the questions in one post but I need to figure out if this procedure is correct

Thank you very much for any advise you can give me

Tan

assembly tophat cufflinks rna-seq • 5.2k views

ADD COMMENT • link •

modified 4.5 years ago by Jennifer Hillman Jackson ♦ 25k • written 4.5 years ago by edoade_2014 • 30

4.5 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hi Tan,

I'll break down your questions with answers.

RNA-seq with Tuxedo pipeline (Tophat, Cufflinks, etc.)

1. Q: Is it possible to input data aligned to reference genome "A" with reference annotation "B"?

No, both must be based on the same exact base reference genome, meaning same chromosomes and coordinate system (nucleotide content is implied at some steps). While this will trigger technical errors with the tools, as will even unintentional variations in chromosome/scaffold/etc naming when the reference genome is in fact the same (one of the most common early errors users experience with this tool set), the underlying reasons are biological. Rearrangements will occur at some rate between any two distinct species - and the purpose of RNA-seq analysis is to identify novel variants or characterize the differential expression between known and/or novel variants. The differences between transcript variants can be subtle. These differences will almost certainly be lost in noise of other rearrangements if performed using cross-species techniques at the genome-wide study level.

But, you can test for this yourself to get a bead on how divergent the species are. Align the sequences from the reference annotation set (the knowns) to the cross-species genome. Perfect match for all or not, and if not at what rate? Then keep in mind that you are only examining known genes (actually, transcripts) and that homology to other knowns is one of the most common methods for discovery and identification/proposed characterization (function, etc.) for newly sequenced data. There is a bias for conserved knowns. Novels/variants will be underrepresented, especially in new data, if this was the method used to create the annotation data. You can also test by aligning the same-species to the native genome and the related-species to the cross genome and compare mapping rates for a rough estimate.

Please understand - cross-species information is very valuable. But it is probably not best for this tool set. But depending on how far you want to take this, you could create a GTF from the cross-species annotation transcripts, aligned to the genome you have, perform some curation on the results, and use that with the pipeline (BLAT would be a good aligner for this). Both would then be technically based on the same genomic backbone and pass through the tool. But I'd be very cautious with this approach and aware of the factors when interpreting the results. Much depends of how similar the genomes and transcriptomes for the species involved really are, and curating gene/transcript data is tedious.

2. Q: Replicates merged before or after mapping?

Run replicates through individually. There is value in this approach. Also remember that you can use Tophat to align to transcriptomes as well as genomes. More about this is in the manual for the tool and there is discussion at the tophat.cufflinks@gmail.com google group. I'd recommend this as a first pass, if the annotation is reasonably complete.

3. Q: Map cross-species and omit native reference annotation?

You could certain test this out. Exploring is good. I also suggested doing the mapping portion of this and comparing to the mapping rate of native RNA-seq data, to better understand what the results are. If the data doesn't map well, this is probably not even worth considering for further analysis (downstream tools). I'd start by mapping the reference annotation cross-species - if that doesn't map well - you have the answer without doing the rest (plus it will be easier to interpret).

4. Q: Best assembler for insect?

I can't help here, I've worked primarily with mammal and plant genomes regarding de-novo assembly. Let's see if someone else on the board can answer. I'd also recommended researching what was used for the genome you have, examining recent literature in your field, and going to the various tool site's home pages since they will often note how the parameters were tuned/use scope (I realize this may sound pedantic - but is not intended that way - and is a good double check even if/when you get advice). Then you have a truth test of sorts - the reference annotation. Try a few assembly methods. Align the annotation to the assemblies. Depending on the quality of the annotation (expect some variation - this will always be true first pass - sometimes in surprising ways), you should be able to make an informed decision.

Good luck with your work!

Jen, Galaxy team

ADD COMMENT • link written 4.5 years ago by Jennifer Hillman Jackson ♦ 25k

Dear Jen,

Thank you for your kind and exhaustive response. I still have one question, I have the gene set annotations (ggf or gtf) and the predicted transcript sequences (.fa). The Tophat2 application on Galaxy allows me only to use .fa files under the "Select reference genome" pop up menu. Would I use the transcripts sequences in this pop up menu? or there is a way to use the gene set annotations as well?

Thanks

ADD REPLY • link written 4.5 years ago by edoade_2014 • 30

Similar posts • Search »