Hi there-
I have been working with a human sample (hg38) for a colleague and I have run into something I am having trouble figuring out when it comes to the output files from cummeRbund that I haven't seen previously. There are several places in my output files where several genes that are near each other on a chromosome get the exact same coordinates in the output tables (e.g., gene.features<-annotation(genes(cuff)) & sig_gene_data = subset(gene_diff_data,(significant=='yes' | q_value < 0.01))). I imagine this is a parsing issue, but I am left scratching my head. I have checked the fpkm files and re-mapped the data to just the correct gene in the case of CENPT (from below as it is definitely in the wrong place as reported) and the mapping numbers are right, but the location is wrong. I ran this through HiSat2 -> Stringtie (-dta-cufflinks) and also through Tophat2 -> Cufflinks2 to get to this point with similar results. As far as I can see the *.gtf file is OK.
In this truncted example NUTF2, CENPT, and THAP11 are all in the same region of Ch16 but in no way to they overlap each other, yet in my files they look like this:
gene_id class_code nearest_ref_id gene_short_name locus length coverage seqnames start end width strand
XLOC_010573 NA NA THAP11 chr16:67828156-67872567 NA NA chr16 67842310 67844195 1886 +
XLOC_010574 NA NA NUTF2 chr16:67828156-67872567 NA NA chr16 67846916 67846985 70 +
XLOC_011145 NA NA CENPT chr16:67828156-67872567 NA NA chr16 67830390 67830548 159 -
Any thoughts would be welcome. Thanks