Integrating Galaxy With A Relational Data Warehouse?

Question: Integrating Galaxy With A Relational Data Warehouse?

8.3 years ago by

Yury Bukhman • 140 wrote:

Hi, we are planning to build a data warehouse for a research center that utilizes multiple high-throughput experimental platforms, e.g. plate- based HTS assays, microarrays of several different types, ChIP-seq, RNA-seq. We have been thinking of managing the data in a relational database. Galaxy looks attractive to us for its workflow management and data provenance features, e.g. to keep track of how raw data are analyzed to produce normalized & summarized datasets and/or final sets of statistics such as p values. We wonder how amenable would Galaxy be to integration with a relational data store. One possible scenario might be to have Galaxy import a dataset from a relational database, run a workflow, then submit the results back to the database with the associated history or link thereto. Another possibility is to forgo the relational database altogether and do all our data management within Galaxy. Any thoughts? We don't have much experience with Galaxy and would appreciate insights from those who do. Many thanks. Yury -- Yury V. Bukhman, Ph.D. Associate Scientist, Bioinformatics Great Lakes Bioenergy Research Center University of Wisconsin - Madison 445 Henry Mall, Rm. 513 Madison, WI 53706, USA Phone: 608-890-2680 Fax: 608-890-2427 Email: ybukhman@glbrc.wisc.edu

chip-seq • 966 views

ADD COMMENT • link •

modified 8.2 years ago • written 8.3 years ago by Yury Bukhman • 140

8.3 years ago by

James Taylor ♦ 470

United States

James Taylor ♦ 470 wrote:

Hi Yury, This is certainly a reasonable possibility. You could have a Galaxy tool for submitting data to your database. I would imagine such a tool would produce a Galaxy dataset as output with whatever unique identifier is necessary to recover exactly that data from the database for another analysis. I can only give you our experience from inside Galaxy. After initial analysis we made a decision to store all data in Galaxy as files on disk, with metadata (data about data, connections between datasets, workflows, et cetera) in a relational database. We feel this decision has worked well. For the scale of data we see, as well as the wide variety of different data types, a relational database did not, and still does not, seem practical to us. -- jt James Taylor Assistant Professor Department of Biology Department of Mathematics & Computer Science Emory University

ADD COMMENT • link written 8.3 years ago by James Taylor ♦ 470

8.2 years ago by

Yury Bukhman • 140

Yury Bukhman • 140 wrote:

Thank you, James, for your reply. I wonder if you could elaborate on why storing the bulk of the data in a relational database seems impractical, or point me to a document where this is discussed at more length. Yury -- Yury V. Bukhman, Ph.D. Associate Scientist, Bioinformatics Great Lakes Bioenergy Research Center University of Wisconsin - Madison 445 Henry Mall, Rm. 513 Madison, WI 53706, USA Phone: 608-890-2680 Fax: 608-890-2427 Email: ybukhman@glbrc.wisc.edu

ADD COMMENT • link written 8.2 years ago by Yury Bukhman • 140

Good afternoon Yury: Typical file sizes are currently running in the 10s and 100s of Gb for most work flows these days. It isn't practical to try and stuff such large single entities into a database. It is much more simple to compute indexes into the file and store the indexes in the database. We do this all the time at the UCSC genome browser. --Hiram

ADD REPLY • link written 8.3 years ago by Hiram Clawson • 260

Exactly. In addition, most relational database are optimized for data that can change, but the access pattern for our raw data is write once. We can implement more efficient storage formats and indexes outside the database for this purpose. -- jt James Taylor Assistant Professor Department of Biology Department of Mathematics & Computer Science Emory University

ADD REPLY • link written 8.2 years ago by James Taylor ♦ 470

8.2 years ago by

Yury Bukhman • 140

Yury Bukhman • 140 wrote:

Thanks again for your comments. The points about huge file sizes and their "write once" nature are convincing. Are the indexes you are talking about already implemented in Galaxy? Is this how it supports its database-like join and subset operations? What about summarized downstream data types, such as gene intensities, p values from statistical tests etc? Those would seem to be relatively low-volume and less immutable. Suppose, as a simple example, I have a gene expression experiment with several samples (be that arrays or RNA-Seq runs), assigned to 2 treatments. I want to set up a workflow that would first summarize the data to get an expression value for each gene (or exon, or transcribed region) in each run, and then do t tests to discover those that are differentially expressed between the treatments. I'll need to support a project that would perform similarly designed experiments over and over again, e.g. with different cell lines and/or treatments. Although the raw data may remain as flat files in a Galaxy data library, wouldn't it make sense to store the summarized data and t test p values in a relational database? Thanks. Yury -- Yury V. Bukhman, Ph.D. Associate Scientist, Bioinformatics Great Lakes Bioenergy Research Center University of Wisconsin - Madison 445 Henry Mall, Rm. 513 Madison, WI 53706, USA Phone: 608-890-2680 Fax: 608-890-2427 Email: ybukhman@glbrc.wisc.edu

ADD COMMENT • link written 8.2 years ago by Yury Bukhman • 140

Hi Yuri - Thank you for your suggestions! I thought that you might be interested in this particular training demo that has a workflow utilizing statistically summarized data. http://main.g2.bx.psu.edu/u/aun1/p/mtdemo-mapping-cheek-reads Best wishes, Jen Galaxy Team -- Jennifer Jackson http://usegalaxy.org

ADD REPLY • link written 8.2 years ago by Jennifer Hillman Jackson ♦ 25k

Please log in to add an answer.

Similar posts • Search »