I have uploaded a database of genes from the UCSC genome browser (Hg38) to Galaxy, to identify the genes with the largest number of polymorphisms in paired-end sequences from a trio (for a Coursera project). I have two questions: 1) the name column in the uploaded dataset does not have standard gene names; how do I convert to standard gene names? 2) I have set the tools in my workflow to use Hg38 consensus, but it isn't available in the UCSC browser. Are the coordinates for Hg38 the same as for Hg38 consensus, or should I change the settings in my tools to Hg38? Thank you.
Hello,
For 1) - Which track and format did you extract from UCSC? Often tables for a track will have gene names in related tables. Find where these are by reviewing table schemas and output using "selected fields from primary and related tables". Submit and a list of related tables will be then available to browse and select content from.
Alternatively, you might want to do the analysis first (using the original annotation data) then at the end link in associated gene names. Input a tabular file that contains the identifiers in your original annotation plus the new identifier you want to add in with and a tool like Text Manipulation > Join two files.
For 2) - The coordinates for the chromosomes in common are exactly the same. This means you can use hg38 canonical reference genome for analysis and view in the hg38 full UCSC browser. The output datasets from Galaxy analysis will have the primary base genome hg38
already assigned as the database
metadata attribute. The assigned database for a dataset is the value key used to link into UCSC's content/genome browser.
Thanks, Jen, Galaxy team