Kind of a naive question, but is the mm10 genome on galaxy the same as GRCm38.ERCC mouse genome?
Hello,
The mouse mm10 genome indexed at Galaxy Main http://usegalaxy.org is sourced from UCSC based off of NCBI's GRCm38. These are the credits: http://genome.ucsc.edu/goldenPath/credits.html#mouse_credits
While GRCm38 from NCBI is technically the same build (in terms of sequence content), the sequence identifiers will differ between the original at NCBI and what UCSC produces. Then ERCC RNA data is an extra layer of annotation added to base genomes available at certain sources (GEO and Ensembl host these, I believe, and perhaps others). The source mm10 from UCSC used at Galaxy Main does not include this content.
If you wish to use a different genome version for mouse than what is available at Galaxy Main, a local/cloud Galaxy can be used with a genome added with a Data Manager (from any source) or you can try using the Custom Genome feature at Galaxy Main - just be aware that using such a large genome as a custom genome may create jobs that run out of memory.
It is important to use the same exact reference genome version for all steps in an analysis or unexpected results are to be expected.
https://wiki.galaxyproject.org/Support#Reference_genomes https://wiki.galaxyproject.org/Support#Custom_reference_genome https://wiki.galaxyproject.org/BigPicture/Choices
Thanks, Jen, Galaxy team
And one should take into account, that NCBI coordinates are 1-based while UCSC's are 0-based! http://genomewiki.ucsc.edu/index.php/Coordinate_Transforms
To be clear, in practical terms, the start coordinate format (0-based or 1-based) is dependent on the datatype of the dataset/file. This is independent of the underlying version of the reference genome.
FAQs: https://galaxyproject.org/support/
- Common datatypes explained https://galaxyproject.org/learn/datatypes/