Converting Ensembl GRCm38 gff to make it into UCSC format

Question: Converting Ensembl GRCm38 gff to make it into UCSC format

10 months ago by

I am using STAR to align my RNAseq reads to UCSC mm10 mouse genome on Galaxy. Galaxy has UCSC mm10.FASTQ mouse sequence in-built, so I'm using this rather than constructing my own from Ensembl GRCm38 as it'd use too much memory.

As a result of this, I have to use UCSC mm10.gtf annotation file to be compatible. Now comes the problem... at the next step (gene counting using HTSeq) I end up getting tons of ambiguous mappings because in UCSC gtfs the gene_id attribute incorrectly contains the same value as the transcript_id attribute and hence a different value for each transcript of the same gene. Hence, if a read maps to an exon shared by several transcripts of the same gene, this will appear to htseq-count as and overlap with several genes. Therefore, these GTF files cannot be used as is.

The simplest solution I can think of is to convert the first column in the Ensembl GRCm38 gff to make it into UCSC format (basically, convert 1 to chr1 etc but with several important exceptions - I have found the mappings on github https://github.com/dpryan79/ChromosomeMappings/blob/master/GRCm38_ensembl2UCSC.txt

Is this possible in Galaxy? Or does anyone have this done already? Surely this is a common problem encountered by people using STAR and HTSeq on Galaxy? How does everyone else overcome it without starting from scratch with Ensembl genome assembly?

Many thanks

igenomes refseq annotation mm10 gtf • 711 views

ADD COMMENT • link •

modified 10 months ago by Jennifer Hillman Jackson ♦ 25k • written 10 months ago by gabrielle.wheway • 0

10 months ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

Converting is possible but using a GTF that is a match for mm10 with distinct gene and transcript identifiers would be simpler. The choice is yours.

Options:

This prior post explains how to obtain and upload the mm10 iGenomes GTF. Gene and transcript ID are distinct and it contains other extra attributes used by the Tuxedo tools (Cufflinks, Cuffdiff, etc) - specifically, p_id, tss_id. and gene_name: https://biostar.usegalaxy.org/p/21827/
An alternative is the mm10 RefSeq GTF available in a public Data Library associated with one of our tutorials that can be found here. You can download it for use at other server or import it directly from the library into a history if working at Galaxy Main https://usegalaxy.org. This will be different than the same GTF exported from the UCSC Table Brower, as it has the gene name populated properly (comes from the column "name2" in the RefSeq track's primary table found in the TB - this gene_id info is lost when exported as a GTF dataset).
More choices, including resolving identifiers mismatches, are covered in this FAQ: https://galaxyproject.org/support/ >> https://galaxyproject.org/support/chrom-identifiers/

Galaxy tutorials for reference: https://galaxyproject.org/learn/

Thanks! Jen, Galaxy team

ADD COMMENT • link modified 10 months ago • written 10 months ago by Jennifer Hillman Jackson ♦ 25k

Similar posts • Search »