1.) I am having trouble adding annotations to my Tophat and Cufflinks
I used the Mus.Musculus 9MM reference using the *built in index. *For
Tophat mapping but no annotations were available in the output files.
I then tried converting the the Ref Genome from the UCSC to a SAM file
using Sam Tools. Tophat would not recognize this but Cufflinks did.
Cufflinks output file did not have the annotation either.
Any thoughts on the proper way to add annotations?
2.) I am also trying to filter the single mapped reads from the
mapped reads that resulted from Tophat. After converting the output
from Tophat I used the filter tool in the Sam Tools choosing *0x100
not primary. *Afterwards I tried to run Cufflinks on the filtered
only to have it fail.
My ultimate goal is to look at RNA seq gene expression. I know that I
to upload my files -> groom using FASTQ groomer -> download a
sequence from UCSC -> convert the reference genome file to a usable
->Run Tophat for mapping using the groomed file and the converted
annotation -> Filter the single mapped reads -> Run cufflinks using
filtered single mapped reads from Tophat.
now I need to get this basic pipeline to work.
Our RNA-seq tutorial and FAQ can help out with the general workflow:
And an iGenomes reference annotation GTF dataset for mm9 is in the
Shared Libraries here:
(Import " genes.gtf" to your history, please ignore other content as
is under revision)
http://usegalaxy.org -> Shared Data -> Data Libraries -> iGenomes ->
To address your questions, one key misunderstanding may be the
difference between a "reference genome" and a "reference annotation"
* "reference genome" = genomic sequence (sourced in .fasta format)
the data is mapped against with TopHat and used as a scaffold for the
RNA-seq tools. Since you are using mm9, selecting the "built-in index"
for mm9 is an appropriate choice. A reference genome does not provide
annotation beyond genomic positional coordinates. When using a mapping
tool, including TopHat, there are mapping parameters that can be set
specify whether to keep only the best or all hits - it sounds as if
need to adjust these parameters in your run. The filter you ran
(question #2) may have removed most or all hits - check the output
the SAM filter, was the output greatly reduced or empty? If so, re-run
TopHat with parameters that keep the best hit from the start and move
Cufflinks from there without filtering through SAMTools. Help is on
tool form itself and in the links to the manual.
* "reference annotation" = known transcripts (sourced in .gtf or .gff3
format) that are also mapped against the reference genome. These
transcript annotations are the most useful when they contain gene,
transcript start site, and other key attributes that the Cuff* tools
interpret. This annotation can guide assembly at various levels (loose
or strict) depending on how the tool parameters are configured. The
annotation MUST be mapped to the same exact reference genome that your
FASTQ datasets are mapped to, with the same exact chromosome naming
the RNA-seq FAQ for details). Help is also on the Cuff* tools
links to the manuals.
More help, including links to tool help is on our wiki here:
(see ' Tools on the Main server: Example: unexpected results with
RNA-seq analysis tools.)
Hopefully this helps,
I'm glad this was helpful! Linking in the reference annotation can be
one of the biggest hurdles when beginning an RNA-analysis project.
Extracting GTF data directly from the UCSC table browser is currently
possible (and is what was used in the RNA-seq tutorial link I shared),
but data from this source does not contain all of the attributes in
9th field that the Cuff* tools can utilize. This results in missed
GTF data can also be obtained from Ensembl. These files have one extra
attribute that the UCSC GTF files do not have, but still do not have
all of the possible attributes, and so also result in missed
functionality when used.
The iGenomes dataset, linked from the Cufflinks website (below)
versions of GTF files from both of these sources that have been
to include the full compliment of attributes. There was an updated
release in May and the Galaxy team has plans to include more iGenomes
reference annotation GTF files in the future in Shared Libraries.
The iGenome additional attribute content itself is present at the
original sources - but contained in related tables or otherwise
formatted in ways that the Cuff* tools cannnot use. Currently, the
iGenomes data covers only a small number of genomes. Really, anyone
with the bioinformatics skills to do the work could create a complete
Cuff* compatible GTF file for any genome contained at UCSC, Ensembl,
other reference genome source that has RefSeq or another other stable
gene/transcript annotation track, using the ancillary tables and some
simple scripting to do the file manipulation. If they wanted, these
resulting GTF files could be loaded into a history and shared with the
Galaxy community using existing "Share or Publish" options. Any shared
GTF dataset that was well constructed (tested for accuracy) and
labeled with sources, etc., I'm sure would be a greatly appreciated
About GTF versus BED - this is a good question. GTF format is what
tool authors selected as the input and it was a good choice over BED
format. I can explain why. It is very true that BED files are easier
manipulate - all data are in distinct columns and most users are
familiar with tabular data and BED format in particular. The 9th field
of GTF files are difficult to work with but this is also the part of
file that is used for most of the conclusion layer functions the Cuff*
tools perform. GTF and BED files are similar in some ways, differ in
others (coordinate system), but most importantly BED does not contain
the attributes field - the key data that the Cuff* tools use to group
and annotate data beyond genomic coordinates. There just isn't a good
place to put this data in the BED data format specification. Besides,
... the tool authors get to decide this sort of thing :)
Thanks for a good discussion and the opportunity to share some more
about these tools & inputs!