How to map RNA-seq reads to an annotated reference genome in GFF format

Question: How to map RNA-seq reads to an annotated reference genome in GFF format

4.4 years ago by

European Union

htreves • 0 wrote:

Hi,

I am trying to map RNA-seq reads to a GFF annotation file I created using tophat2 through Galaxy. When trying to select a reference genome from my history files, no file is identified as an option. I've uploaded a GFF3 file. the same thing happened even when I've uploaded a GTF file from the "RNA-seq Analysis Exercise" page (named Galaxy Dataset | iGenomes UCSC hg19, chr19 gene annotation).

What am I doing wrong and how can I get tophat to work with my data?

Thanks,

Haim Treves

Dept. Plant and Environmental Sciences

The Alexander Silberman Institute of Life Sciences

The Hebrew University of Jerusalem

91904 Jerusalem, Israel

Phone(Lab): 972 2 6585204/31

Fax (Lab): 972 2 6584463

gff tophat galaxy rna-seq • 11k views

ADD COMMENT • link •

modified 4.4 years ago • written 4.4 years ago by htreves • 0

4.4 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

The RNA-seq reads must be aligned against a reference genome or transcriptome for use with Tophat. GTF files and the top portion (or all) of a GFF3 file is a reference annotation dataset - describing features on a reference genome/transcriptome. Protocol help is here:
https://wiki.galaxyproject.org/Support#Tools_on_the_Main_server:_RNA-seq

The iGenome's GTF collection has an option for UCSC-published version of genomes. These are in Galaxy as built-in indexes - search by the short genome name (or "dbkey) to locate. For example, "hg19" or "mm10" or "dm3".

If supplying a custom reference genome, load and use a fasta dataset. Instructions are in the link below. Please be aware that when using GFF3 datasets, the tool expects for just the top annotation portion of the file to be used as "reference annotation" - if there is fasta sequence at the end of the file, use this in a distinct dataset and as the "reference genome".
https://wiki.galaxyproject.org/Learn/Datatypes#GFF3

Assigning datatype to each correctly is important - use the pencil icon to do this.
https://wiki.galaxyproject.org/Support#Tool_doesn.27t_recognize_dataset
More about Custom genomes:
https://wiki.galaxyproject.org/Support#Custom_reference_genome

Take care, Jen, Galaxy team

ADD COMMENT • link modified 4.4 years ago • written 4.4 years ago by Jennifer Hillman Jackson ♦ 25k

4.4 years ago by

htreves • 0

European Union

htreves • 0 wrote:

Thanks,

It was very helpful and now I got all the output files from Tophat. Now, even after creating a New track browser in the visualization menu, and using my genome as reference genome, when trying to add datasets of these output files, they are not recognized and the browser shows that there are no items in my unnamed history, when I see all of these files in my history pane.

How can I get the visualization tool to recognize these files?

Thanks again,

Haim

ADD COMMENT • link written 4.4 years ago by htreves • 0

4.4 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

You added your Custom Reference Genome as a Custom Build (necessary for Trackster, and can be also found/added under "User -> Custom Builds" for any one wishes to include)?

If yes, then go through your datasets and assign the "database" to be your Custom Build. This will tell Galaxy that they are mapped to this same genomic backbone as the Trackster visualization is based on. All that are accepted formats (most standard are) should show up in the add datasets window to choose from.

Best,

Jen

Galaxy team

ADD COMMENT • link written 4.4 years ago by Jennifer Hillman Jackson ♦ 25k

4.4 years ago by

htreves • 0

European Union

htreves • 0 wrote:

Hey again,

Now it seems to work! Thank you! :)

Going over the data, I can see the reads mapped to a location in the reference sequence (fasta format), but cannot tell how it corresponds to the annotated genes, since I could only use fasta format file as the reference (and not an annotation file, like the gff3 that I have).

We tried to add the gff dataset to the visualization for that purpose, but we get the following error:

Input error: Chromosome 140113503864811 found in your input file but not in your genome file.
needLargeMem: trying to allocate 0 bytes (limit: 100000000000)

Similar posts • Search »