I am using STAR to align my RNAseq reads to UCSC mm10 mouse genome on Galaxy. Galaxy has UCSC mm10.FASTQ mouse sequence in-built, so I'm using this rather than constructing my own from Ensembl GRCm38 as it'd use too much memory.
As a result of this, I have to use UCSC mm10.gtf annotation file to be compatible. Now comes the problem... at the next step (gene counting using HTSeq) I end up getting tons of ambiguous mappings because in UCSC gtfs the gene_id attribute incorrectly contains the same value as the transcript_id attribute and hence a different value for each transcript of the same gene. Hence, if a read maps to an exon shared by several transcripts of the same gene, this will appear to htseq-count as and overlap with several genes. Therefore, these GTF files cannot be used as is.
The simplest solution I can think of is to convert the first column in the Ensembl GRCm38 gff to make it into UCSC format (basically, convert 1 to chr1 etc but with several important exceptions - I have found the mappings on github https://github.com/dpryan79/ChromosomeMappings/blob/master/GRCm38_ensembl2UCSC.txt
Is this possible in Galaxy? Or does anyone have this done already? Surely this is a common problem encountered by people using STAR and HTSeq on Galaxy? How does everyone else overcome it without starting from scratch with Ensembl genome assembly?