Hi there,
I am using the cummeRbund package in R Studio on my personal computer to analyze differentially expressed genes using cuffdiff output files from Galaxy.
When extracting the list of genes that are differentially expressed, I initially get only the XLOC names. To get the UCSC names, I do the following:
SigGenesData_ContCORT <- getSig(cuff_data, level = "genes",'Control','CORT', alpha = 0.05)
DiffGenes_ContCORT <- getGenes(cuff_data, SigGenesData_ContCORT)
GeneIDs_ContCORT <- featureNames(DiffGenes_ContCORT)
But I still want the actual gene symbols, not just the UCSC names so my work-around this is to download a file from the UCSC site that lists gene symbols by known UCSC ID. I then import this list into R Studio and I can merge it with my GeneIDs_ContCORT data table using the UCSC Id column, which would be common to both tables.
This works fine with the except of one problem. Some genes have multiple UCSC ID's so in my GeneIDs_ContCORT table, a good number of genes have several names separated by a comma. For example:
tracking_id gene_short_name
1 XLOC_000525 uc007csi.1,uc007csj.1,uc007csk.1
So when I merge the tables, R doesn't match this to a UCSC from the downloaded list and just gives me "NA" because it's reading those 3 names as one name and can't find a match.
Is there a way I can instruct R to remove all but the first value in the gene_short_name column so that it can be properly matched? I.e. desired output:
tracking_id gene_short_name
1 XLOC_000525 uc007csi.1
Alternatively, is there a way to merge the tables such that the multiple UCSC names in one cell are all read separately and matched in my merged table?
Any help would be appreciated.
Thanks