GEMINI includes hardcoded database versions in its build procedure. GEMINI on Galaxy seems to conform to this; it's slightly confusing because the database build date in the appropriate dropdown for GEMINI load is stated as later than the GEMINI release (0.8.1). However, the underlying data looks to belong to that GEMINI release, so I assume this is just a reflection of when this GEMINI build was incorporated into Galaxy.
Hello,
I confirmed and the date associated with Gemini indexes is the date that the data was retrieved.
This retrieval uses the same methods as the line-command methods, except they are wrapped in a Galaxy data manager that does some extra formatting/manipulations to the metadata and internal database links. The name given is custom. Whatever version was the current version of the data at that time is what was indexed and included in Galaxy. I don't see a way to view the timestamps on the original files from the source but I don't think that is/was intended to be revealed by the tool authors.
Hope that helps! Jen, Galaxy team
Thanks Jen,
When you say that the name given is custom, are you saying that 0.8.1 doesn't actually correspond to GEMINI 0.8.1?
Also, since each version of GEMNI seems to hardcode the versions of each database, it seem that Galaxy must have retrieved whatever annotation source version that were listed in the script, since you state that Galaxy didn't modify the build scripts. For instance, from the gemini 0.8.1 source, make-dbsnp.sh, line 17: tabix dbsnp.b141.20140813.vcf.gz . This seems to strongly indicate that dbSNP 141 was used, correct?
Yes, you have mapped the data correctly.
By custom name I mean that whoever generated the index can add in whatever name they want. Ideally, this is descriptive of the source/version. But for this particular tool's indexes, the download date was used instead. We are discussing how to label data better (the "best" label to use can be non-trivial in some cases).
The issue comes down to two factors: 1) dealing with the extremely long length some external datasets (genomes and reference data) require to fully label the source/version and 2) issues around changing labels once already published (creates problems for people already using that data e.g. label confusion). But we will figure something out. Individual genomes already have a full label (source/version) available - the list of genomes in the Upload tool is an example.
Thanks for bringing this up. For this tool in particular, we will incorporate the version for indexes created in the future and are considering changing the current index label to include version.