Question: Converting Ensembl GRCm38 gff to make it into UCSC format
0
gravatar for gabrielle.wheway
10 months ago by
gabrielle.wheway0 wrote:

I am using STAR to align my RNAseq reads to UCSC mm10 mouse genome on Galaxy. Galaxy has UCSC mm10.FASTQ mouse sequence in-built, so I'm using this rather than constructing my own from Ensembl GRCm38 as it'd use too much memory.

As a result of this, I have to use UCSC mm10.gtf annotation file to be compatible. Now comes the problem... at the next step (gene counting using HTSeq) I end up getting tons of ambiguous mappings because in UCSC gtfs the gene_id attribute incorrectly contains the same value as the transcript_id attribute and hence a different value for each transcript of the same gene. Hence, if a read maps to an exon shared by several transcripts of the same gene, this will appear to htseq-count as and overlap with several genes. Therefore, these GTF files cannot be used as is.

The simplest solution I can think of is to convert the first column in the Ensembl GRCm38 gff to make it into UCSC format (basically, convert 1 to chr1 etc but with several important exceptions - I have found the mappings on github https://github.com/dpryan79/ChromosomeMappings/blob/master/GRCm38_ensembl2UCSC.txt

Is this possible in Galaxy? Or does anyone have this done already? Surely this is a common problem encountered by people using STAR and HTSeq on Galaxy? How does everyone else overcome it without starting from scratch with Ensembl genome assembly?

Many thanks

ADD COMMENTlink modified 10 months ago by Jennifer Hillman Jackson25k • written 10 months ago by gabrielle.wheway0
0
gravatar for Jennifer Hillman Jackson
10 months ago by
United States
Jennifer Hillman Jackson25k wrote:

Hello,

Converting is possible but using a GTF that is a match for mm10 with distinct gene and transcript identifiers would be simpler. The choice is yours.

Options:

  1. This prior post explains how to obtain and upload the mm10 iGenomes GTF. Gene and transcript ID are distinct and it contains other extra attributes used by the Tuxedo tools (Cufflinks, Cuffdiff, etc) - specifically, p_id, tss_id. and gene_name: https://biostar.usegalaxy.org/p/21827/

  2. An alternative is the mm10 RefSeq GTF available in a public Data Library associated with one of our tutorials that can be found here. You can download it for use at other server or import it directly from the library into a history if working at Galaxy Main https://usegalaxy.org. This will be different than the same GTF exported from the UCSC Table Brower, as it has the gene name populated properly (comes from the column "name2" in the RefSeq track's primary table found in the TB - this gene_id info is lost when exported as a GTF dataset).

  3. More choices, including resolving identifiers mismatches, are covered in this FAQ: https://galaxyproject.org/support/ >> https://galaxyproject.org/support/chrom-identifiers/

Galaxy tutorials for reference: https://galaxyproject.org/learn/

Thanks! Jen, Galaxy team

ADD COMMENTlink modified 10 months ago • written 10 months ago by Jennifer Hillman Jackson25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 169 users visited in the last hour