Identifying Genes

Question: Identifying Genes

5.0 years ago by

I am very new to Galaxy. We have performed a comparative analysis between the transcriptomes of different samples. We performed the analysis using Galaxy software (Tophat; CuffDiff; etc). What my PI has done is compiled a list of all the genes differentially expressed between each set, each in a separate excel sheet. So what I have is an excel spreadsheet with a list (usually around 300) of test id, gene id, and locus (ChrX:111111111-22222222222). Initially, we have been identifying each gene individually, one at a time, by pasting the locus into the UCSC browser. This works, but is incredibly tedious. There has to be a better way in Galaxy. I have tried making BED files out of the loci, but so far I have been unable to identify genes using galaxy. Can someone please explain how I can take my long list of loci and get gene names, ID, function, and possibly some downstream comparative ontologies to begin analyzing. Like I said, very new to Galaxy and genomics. Thanks very much

galaxy • 1.4k views

ADD COMMENT • link •

modified 5.0 years ago by Jennifer Hillman Jackson ♦ 25k • written 5.0 years ago by Loupe, Jacob M. • 10

5.0 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hi Jacob, Using the tool "Get Data -> UCSC Main table browser", data can be retrieved directly using either gene symbols or locus positions. A good track to go against is "UCSC Genes", if available for your genome. "RefSeq Genes" is another good choice. But really any track in the group "Gene and Gene Prediction Tracks" is worth a look to see if it is fit for what you are interested in, as the content can vary between genomes and even builds. The specifics can be reviewed at UCSC by clicking into the "describe table schema" area (button next to "table" selection, start with default table). To search multiple gene symbols, enter the list in the form under "identifiers". To search multiple loci, enter the list under "region" (define regions). These both accept a text file, so download the information, cut out of the original file, formatted how the UCSC form states from Galaxy as text (tabular). Or, export as text from the Excel spreadsheet. 300 should be fine at once, I believe the limits are around 1000 per query for each of these. At this point in the query, the extract would just pull basic data from the single primary table. To also pull out related information, change the "output file" type to be "selected fields from primary and related tables" and then click on "get output". The next form is where you can link in additional tables of data. The general idea is to add the table, then select the specific fields that you want to include. Again, any of these can be reviewed before the final query is made using the first main form and then the "describe table schema" button, or once in that describe view, by clicking on related tables to navigate. When doing the query this way, the Table browser takes care of the relational joins for you, just as an SQL query would. For more help about using the UCSC table browser, these links are good places to start, and for detailed questions about a specific piece of data that you cannot locate, the support team for the browser can almost certainly help. The Table browser is not your only option (flat text files and a mySQL database are available), but this is a web-based access point to the information, easily imported into Galaxy or downloaded for further analysis. There are also other types of queries possible, at UCSC and in Galaxy, this is just the most direct I know of, for your question and original data: https://genome.ucsc.edu/goldenPath/help/hgTablesHelp.html https://genome.ucsc.edu/FAQ/FAQmaillist.html One note: you have the locus position with a chromosome identifier in the format "Chr1" in your email. I am not sure if this was intentional or not - but you will need to format the identifiers to match those in the target reference genome, just as they were in the original analysis. In general, this would mean the format would be "chrX" instead (case matters). So, check/adjust the case/format to avoid problems, these really do have to be an exact match. The same is true for gene names/symbols - you can always search in the browser to see what the format is if something is missing and adjust. Also make sure that Excel does not output any hidden characters (line wraps) - stick with plain text cells for best results if you plan to output/use the data with external tools. You probably know most of this, but just in case I wanted to point out where the gotchas could be. Even if using gene names for this, you may want to use the position later on, and identifiers in the correct format from the start are a good idea. Hopefully this gets you started! Jen Galaxy team -- Jennifer Hillman-Jackson http://galaxyproject.org

ADD COMMENT • link written 5.0 years ago by Jennifer Hillman Jackson ♦ 25k

Similar posts • Search »