Question: TopHat and Microbial RNASeq
gravatar for gkuffel22
3.8 years ago by
United States
gkuffel22170 wrote:

Hi everyone,


I am trying to use TopHat for RNASeq analysis of the bacterium Vibrio fischeri. I've run into some problems, first this genome is obviously not built into Galaxy. I finally figured out how to build  a custom genome so I'm good there but now I believe I need a GTF file for the gene annotation and I have no idea where to find this or how to build this. Does anyone have any expertise in this?


rna-seq • 1.3k views
ADD COMMENTlink modified 3.8 years ago by Jennifer Hillman Jackson25k • written 3.8 years ago by gkuffel22170
gravatar for Jennifer Hillman Jackson
3.8 years ago by
United States
Jennifer Hillman Jackson25k wrote:


A reference annotation dataset is probably best obtained from a data provider, if one can be identified. And you can run the pipeline without annotation. The results would just reflect the content of your NGS sequence inputs and certain features of tools (such as Cuffdiff) would not be utilized. More about the annotation features used by these tools can be found at the Cufflinks web site:

This genome is hosted in the UCSC Archaeal Genome Browser The availability and type of annotation varies by strain. Also review the "Resources" tab, one of these research groups may have the annotation data you want, in GTF or GFF3 format. There are almost certainly other options. Reviewing publications is probably a good place to start (to gain insight into what others performing similar analysis are using).

If you do decide to use a reference annotation dataset, be sure to use the same exact reference genome that it is based on for your analysis. This may mean creating a new Custom Genome. The sequence identifiers, content, and lengths must be exact between all inputs, meaning created from the same build and using the same nomenclature.

Best, Jen, Galaxy team

ADD COMMENTlink written 3.8 years ago by Jennifer Hillman Jackson25k

First off, thank you so much for your help. You have been incredibly helpful. So I did find the genome hosted in the UCSC Archaeal Genome Browser so I was able to send the output of the gtf file to Galaxy which was great, but as you already mentioned this caused issues because I created my reference genome from NCBI using accession #NC_006840.2 so this gtf file doesn't seem to match this. 

I don't see a way to download a single fasta file from UCSC, I only see a CDS fasta from a  multiple alignment and NCBI has a gff file that matches the original fasta file but I don't think that will work either. Are you also saying that the analysis can be done without the gtf file in the first place, I thought the algorithm for Tophat needed a gtf file?

ADD REPLYlink written 3.8 years ago by gkuffel22170

Very glad you were able to locate an annotation dataset, that is the difficult part sometimes. From here, obtaining the matching genome should be straightforward.

The build notes for the genome source should be noted in the UCSC browser (often on the first page, near the bottom). If it is not present, you could write into their support and ask which build was used and the source, then obtain the same exact version. As far as I know, this version of the browser does not host the reference genomes through an FTP website, but double check with them, it was recently updated.

For the annotation file, it is not required with any of the tools from this suite. Using it with Tophat is just one option, as it is with the downstream tools. The annotation would only provide additional splice junctions - which can be used as a guide (will supplement splices found in your sequence input) or as truth (only splices in the annotation will be considered). The Tophat manual explains the difference in more detail so that you can fully understand how/when/if to use it for different analysis goals. The same is true for Cufflinks, etc.

Take care, Jen, Galaxy team

ADD REPLYlink written 3.8 years ago by Jennifer Hillman Jackson25k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 169 users visited in the last hour