Question: Storing Compressed Data Files
0
gravatar for Assaf Gordon
10.5 years ago by
Assaf Gordon320
United States
Assaf Gordon320 wrote:
Hello, As users add more and more data files to our galaxy server, disk space becomes a problem... What I'd ultimately like is to store the data files in some compressed manner (at least some of the textual files), how would you suggest to do that ? A common scenario is: 1. User uploads a big Fastq/solexa file (=> 1.2 GB) 2. FASTQ file converted to FASTA file (=> 0.6 GB) 3. FASTA file trimmed, clipped, stripped, etc. (=> 100 MB) 4. BLAT, Histograms and other reports (=> ~50 MB) The first three data sets take about 1.9 GB of disk space - and aren't really needed by the user (as he/she is mostly interested in the resulting report files). Since these are textual files, they compress really well. Currently, I store the FASTQ gzip'ed in galaxy, and my tools know how to read gzip'ed data. There are two shortcomings with this method: 1. datasets (green squares) of gzip'ed files don't display any data in the peek window 2. Other galaxy tools which require FASTQ file as input can't read my file. Perl has an I/O module (PerlIO::gzip) which makes reading gzipped files transparent to the rest of the program. I think python has something very similar (http://www.python.org/doc/lib/module-gzip.html). If it's not too much to ask, would it be possible to add support for reading gzip'ed files ? At least in the peek/preview window ? Comments are welcomed, Thanks, Gordon.
galaxy • 1.2k views
ADD COMMENTlink modified 10.5 years ago by Greg Von Kuster840 • written 10.5 years ago by Assaf Gordon320
0
gravatar for Asim Siddiqui
10.5 years ago by
Asim Siddiqui10 wrote:
Hi Assaf, An alternative approach is to utilize a binary file format specifically designed for sequence data in a compact manner. An example of that is Sequence Read Format (SRF). SRF has been incorporated into the Illumina and Helicos pipelines and will be available for the AB platform shortly. SRF includes support for compression using several schemes including ZLIB. This thread has been captured on the genographia website and I've commented there. There is also a link to more information on SRF. Note: in terms of implementation, there is a C version (most complete), a C++ prototype (with a complete C++ implementation coming soon) and an early Java implementation. http://www.genographia.org/portal/topics/sequence-read-format-srf /galaxy-and -file-size-management http://www.genographia.org/portal/topics/sequence-read-format- srf/sequence-r ead-format-srf Asim Date: Wed, 4 Jun 2008 16:26:30 -0700 To: <galaxy-user@bx.psu.edu> Subject: [galaxy-user] Storing compressed data files Hello, As users add more and more data files to our galaxy server, disk space becomes a problem... What I'd ultimately like is to store the data files in some compressed manner (at least some of the textual files), how would you suggest to do that ? A common scenario is: 1. User uploads a big Fastq/solexa file (=> 1.2 GB) 2. FASTQ file converted to FASTA file (=> 0.6 GB) 3. FASTA file trimmed, clipped, stripped, etc. (=> 100 MB) 4. BLAT, Histograms and other reports (=> ~50 MB) The first three data sets take about 1.9 GB of disk space - and aren't really needed by the user (as he/she is mostly interested in the resulting report files). Since these are textual files, they compress really well. Currently, I store the FASTQ gzip'ed in galaxy, and my tools know how to read gzip'ed data. There are two shortcomings with this method: 1. datasets (green squares) of gzip'ed files don't display any data in the peek window 2. Other galaxy tools which require FASTQ file as input can't read my file. Perl has an I/O module (PerlIO::gzip) which makes reading gzipped files transparent to the rest of the program. I think python has something very similar (http://www.python.org/doc/lib/module-gzip.html). If it's not too much to ask, would it be possible to add support for reading gzip'ed files ? At least in the peek/preview window ? Comments are welcomed, Thanks, Gordon. _______________________________________________ galaxy-user mailing list galaxy-user@bx.psu.edu http://mail.bx.psu.edu/cgi-bin/mailman/listinfo/galaxy-user
ADD COMMENTlink written 10.5 years ago by Asim Siddiqui10
0
gravatar for Greg Von Kuster
10.5 years ago by
Greg Von Kuster840 wrote:
Hello Assaf, See the http://g2.trac.bx.psu.edu/wiki/PurgeHistoriesAndDatasets wiki. You may find it useful to configure the cleanup_datasets scripts in cron to removed "deleted" datasets from disk after a configured number of days. Let me know if you have any questions about this process. This can be corrected in a fairly easy way. Just add "gzip" as a new data type ( see http://g2.trac.bx.psu.edu/wiki/AddingDatatypes ). In your "GZIP" class, include a "display_peek()" method or a "make_html_table()" method that will display what you want for the "gzip" data type. A close example of what you may need is available in the Binseq() class in ~/lib/galaxy/datatypes/images.py. We'll certainly take this under consideration. Galaxy currently does support retrieving compressed files from external data sources ( UCSC ) as well as uploading them via the upload utility. However, they are currently decompressed on-the-fly. Allowing them to remain compressed would require tools to decompress them - we'll see if maybe this makes sense for some tools.
ADD COMMENTlink written 10.5 years ago by Greg Von Kuster840
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 182 users visited in the last hour