cummeRbund: gene_short_name and gene_id match each other but not true coordinates in gtf file

Question: cummeRbund: gene_short_name and gene_id match each other but not true coordinates in gtf file

24 months ago by

nblouin69 • 0 wrote:

Hi there-

I have been working with a human sample (hg38) for a colleague and I have run into something I am having trouble figuring out when it comes to the output files from cummeRbund that I haven't seen previously. There are several places in my output files where several genes that are near each other on a chromosome get the exact same coordinates in the output tables (e.g., gene.features<-annotation(genes(cuff)) & sig_gene_data = subset(gene_diff_data,(significant=='yes' | q_value < 0.01))). I imagine this is a parsing issue, but I am left scratching my head. I have checked the fpkm files and re-mapped the data to just the correct gene in the case of CENPT (from below as it is definitely in the wrong place as reported) and the mapping numbers are right, but the location is wrong. I ran this through HiSat2 -> Stringtie (-dta-cufflinks) and also through Tophat2 -> Cufflinks2 to get to this point with similar results. As far as I can see the *.gtf file is OK.

In this truncted example NUTF2, CENPT, and THAP11 are all in the same region of Ch16 but in no way to they overlap each other, yet in my files they look like this:

gene_id class_code nearest_ref_id gene_short_name locus length coverage seqnames start end width strand

XLOC_010573 NA NA THAP11 chr16:67828156-67872567 NA NA chr16 67842310 67844195 1886 +

XLOC_010574 NA NA NUTF2 chr16:67828156-67872567 NA NA chr16 67846916 67846985 70 +

XLOC_011145 NA NA CENPT chr16:67828156-67872567 NA NA chr16 67830390 67830548 159 -

Any thoughts would be welcome. Thanks

cummerbund rna-seq • 719 views

ADD COMMENT • link •

modified 23 months ago • written 24 months ago by nblouin69 • 0

23 months ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

This was produced using the most current version of cummeRbund tools as wrapped for Galaxy?

Are there additional XLOCs that report the same locus? That can trigger over-clustering if there is overlap. If any are associated with rRNA, tRNA, etc., removing these should help.

CENPT might be in the wrong location due to a mismatch between the genome build the GTF is based on and the genome build used for mapping (UCSC's hg38). Perhaps paste in a few lines in a reply comment if you are not sure. Please also include the exact source.

There could be other factors, but let's start there.

ADD COMMENT • link written 23 months ago by Jennifer Hillman Jackson ♦ 25k

Similar posts • Search »