Inbuilt reference genome

Question: Inbuilt reference genome

19 months ago by

puri.sapna • 0 wrote:

Hi,

I ran a top hat analysis on samples, but after the cuffdiff step, I do not have any genes annotated. I wanted to rerun my analysis using a downloaded version of the reference genome (mm10 in my case) but the only way to run TopHat is to use the inbuilt reference genome (which, in my case, gave me no gene names). Can I bypass the defualt Galaxy reference genome and use my own file? Please help.

Thanks, Sapna

rna-seq • 720 views

ADD COMMENT • link •

modified 19 months ago • written 19 months ago by puri.sapna • 0

19 months ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

The mm10 reference genome native to Galaxy is the correct one to use for mapping (Tophat or HISAT). Use that same reference genome with downstream tools that require a genome to be specified.

To obtain annotated genes you will want to upload a reference annotation dataset and use that with CuffMerge (along with Cufflinks GTFs), to produce a complete GTF of all transcripts (novel and known). Then use that GTF produced by CuffMerge as the combined reference annotation input dataset with Cuffdiff.

Alternatively, you can skip running Cufflinks and CuffMerge and instead use the reference annotation (GTF) directly from the source (iGenomes is best) with Cuffdiff.

There are even more choices about when to use a reference annotation dataset - during the mapping step and/or during the transcript assembly. Each of these workflow options produces slightly different results, depending on your goals: discovery of novel transcripts plus differential expression versus known transcripts (only) plus differential expression.

Please see this prior Q&A for where to get the best version of a reference annotation dataset for mm10: https://biostar.usegalaxy.org/p/21827/#21845

For more on how the complete process is run, including a description of the alternatives, the manual and tutorials here have example usage:

Manual for this tool set: http://cole-trapnell-lab.github.io/cufflinks/manual/
RNA-seq tutorials: https://galaxyproject.org/learn/

We hope this helps! Jen, Galaxy team

ADD COMMENT • link written 19 months ago by Jennifer Hillman Jackson ♦ 25k

19 months ago by

puri.sapna • 0

puri.sapna • 0 wrote:

Thanks Jennifer. I am obviously misremembering how I did this in the past. I thought the mm10.gtf served as a reference genome as well as a reference annotation file. When I last did this, Galaxy did not have the feature of using the in built reference genome and I had to upload mm10 and use it for tophat and cuffdiff, and it all worked.

Based on the Q&A you passed along, my conclusion is that the Galaxy inbuilt mm10 reference genome does not have gene names, is that correct? So if I need the gene names, I need to download the iGenome version of mm10 and then upload it onto my workflow on Galaxy. However, Galaxy still won't let me select another gtf file for tophat, will it? It appears to need FASTA files. If you could clarify that, I would appreciate it.

Thanks, Sapna

ADD COMMENT • link written 19 months ago by puri.sapna • 0

Hi Sapna,

There are two distinct inputs:

Reference genome:

Fasta file that is indexed for tools, either built-in or from a custom genome.
Most of the tools in this suite have an option where the genome is selected (mm10 is pre-indexed)
This is the mm10 index that is being picked on the tool form in a pull-down selectable menu (for nearly all tools) when the target genome option is labeled as "genome" or "database" or "build" or similar. Sometimes the "database" for the genome used to generate datasets needs to be assigned for those data to be recognized as input datasets for downstream tools (if not automatically assigned).
For Tophat, the form labels for you would be:
- Use a built in reference genome or own from your history > Use a built-in genome
- Select a reference genome > mm10

Reference annotation:

GTF/GFF3 data for this use case.
Most of the tool in this suite have an option where the annotation is selected
This type of input is provided by the user for most tools, including this one
If reference annotation data happens to be available for a tool (in some cases it requires a pre-computed index), that data will also be in a pull-down selectable menu, but this option will not be labeled the same way as a genome on the tool form.
For Tophat, the form labels for you would be:
- TopHat settings to use > Full parameter list
- Do you want to supply your own junction data > Yes
- Use Gene Annotation Model > Yes
- Gene Model Annotations > select the GTF/GFF3 annotation dataset from your history

Thanks, Jen

ADD REPLY • link modified 18 months ago • written 18 months ago by Jennifer Hillman Jackson ♦ 25k

Please log in to add an answer.

Similar posts • Search »