Question: Mutliple UCSC gene names in RStudio Data Table
gravatar for nashedm
2.6 years ago by
United States
nashedm10 wrote:

Hi there,

I am using the cummeRbund package in R Studio on my personal computer to analyze differentially expressed genes using cuffdiff output files from Galaxy.

When extracting the list of genes that are differentially expressed, I initially get only the XLOC names. To get the UCSC names, I do the following:

SigGenesData_ContCORT <- getSig(cuff_data, level = "genes",'Control','CORT', alpha = 0.05)

DiffGenes_ContCORT <- getGenes(cuff_data, SigGenesData_ContCORT)

GeneIDs_ContCORT <- featureNames(DiffGenes_ContCORT)

But I still want the actual gene symbols, not just the UCSC names so my work-around this is to download a file from the UCSC site that lists gene symbols by known UCSC ID. I then import this list into R Studio and I can merge it with my GeneIDs_ContCORT data table using the UCSC Id column, which would be common to both tables.

This works fine with the except of one problem. Some genes have multiple UCSC ID's so in my GeneIDs_ContCORT table, a good number of genes have several names separated by a comma. For example:

        tracking_id              gene_short_name

1  XLOC_000525           uc007csi.1,uc007csj.1,uc007csk.1

So when I merge the tables, R doesn't match this to a UCSC from the downloaded list and just gives me "NA" because it's reading those 3 names as one name and can't find a match. 

Is there a way I can instruct R to remove all but the first value in the gene_short_name column so that it can be properly matched? I.e. desired output:

        tracking_id              gene_short_name

1  XLOC_000525           uc007csi.1

Alternatively, is there a way to merge the tables such that the multiple UCSC names in one cell are all read separately and matched in my merged table?

Any help would be appreciated.


ADD COMMENTlink modified 2.6 years ago by Jennifer Hillman Jackson24k • written 2.6 years ago by nashedm10
gravatar for Jennifer Hillman Jackson
2.6 years ago by
United States
Jennifer Hillman Jackson24k wrote:


It might be easiest to back up and use a reference GTF/GFF3 file that contains the attribute "gene_name" with Cuffmerge and then Cuffdiff. One source is iGenomes. This will avoid downstream complications.

Data files can be manipulated in many ways on the command line, or within Galaxy, or with RStudio, but the above is the most direct approach.

Best, Jen, Galaxy team

ADD COMMENTlink written 2.6 years ago by Jennifer Hillman Jackson24k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 114 users visited in the last hour