Question: Handling Large Files In Galaxy
0
gravatar for Brad Chapman
9.4 years ago by
Brad Chapman240
United States
Brad Chapman240 wrote:
Hi all; I've recently gotten a local Galaxy install up and running for our group. We do a lot of short read sequencing analysis and are looking at Galaxy as a framework to present the data and custom analyses associated with it. One of our main interests is scaling the presentation to large fastq and alignment files. Specifically, we have a case where we'd like to make a large ~300Gb alignment file available to users to query and retrieve sections of alignments corresponding to genomic coordinates. We have a custom C++ program that does this, and would like to plug it in through the tools interface. We'd ideally like to use the Library permissions interface to make this available to certain users. Would anyone be able to offer some advice about the best way to handle this? The standard upload, history, analyze would not be ideal since this large file would be copied around. We've brainstormed 3 different ways to approach this: - Have "special" uploaded files which are actually symlinks to the original file and do not get copied. This looks relatively difficult on my initial assessment. - Pass the logged in user to the C++ program and embed the logic of finding the right file within the external tool. Here we would need some advice about if it were possible to pass the current user through the tools interface. - The hack solution: upload a file that is actually just a link reference to the desired file, and this file gets passed to the external tool. The tool then can read the tiny file, know what large file to access, and proceed from there. This would involve some new datatype integration to handle the hack. I am still relatively uninitiated in the Galaxy way, so could use some advice on if any of these solutions are more likely to work smoothly then others. Generally, what sort of approach is Galaxy taking towards increasingly massive files? Is anyone else doing something similar? Thanks for any thoughts, Brad
galaxy • 1.8k views
ADD COMMENTlink modified 9.4 years ago by Greg Von Kuster840 • written 9.4 years ago by Brad Chapman240
0
gravatar for Greg Von Kuster
9.4 years ago by
Greg Von Kuster840 wrote:
Hello Brad, Galaxy already will handle large files in the way that you describe if you upload the file(s) to a library, creating what we refer to as a library dataset. With library datasets, there is 1 file on disk, even if users "import the library dataset" into their history for their own analysis. When users do this, it may look like they have created their own copy of the file on disk, but they are really just working with a pointer to the single disk file. If you do not associate any roles with the the "access" permission on the library dataset, it is considered public, and anyone can access it. However, if you associate roles with the access permission on the dataset, a user must have every role associated with the access permission in order to access the dataset in the library. Galaxy performs checks to ensure that the roles associated with the access permission on library datasets do not result in the dataset becoming inaccessible by all users. Regarding integration of your proprietary C++ tool if you have questions about that, please refer to our wiki at http://g2.trac.bx.psu.edu/wiki/AddingTools. Please don't hesitate to contact us / me with any additional questions as you work through this process, and we'll make sure you get all of the help you need for this work. Thanks very much, Greg Von Kuster Galaxy Development Team
ADD COMMENTlink written 9.4 years ago by Greg Von Kuster840
than the appreciated. While (from others experience) Galaxy _should_ be able to upload files that large, we've had some problems with our local installation too. Investigation didn't reveal any cause, so we put it down to the quality of our network. You might what to look at the webserver or proxy that you have in front of Galaxy - from memory, both Apache and nginx can be configured to impose file size limits, so that _may_ be the problem. In any event, you might want to configure your server to handle uploads and downloads directly as per <https: bitbucket.org="" galaxy="" galaxy-="" central="" wiki="" config="" productionserve="" r="">. Finally you can pass an url to the upload dialog to get Galaxy to pull the file from an ftp server - for example - and that may prove more robust. Paul Agapow (paul-michael.agapow@hpa.org.uk) Bioinformatics, Centre for Infections, Health Protection Agency ********************************************************************** **** The information contained in the EMail and any attachments is confidential and intended solely and for the attention and use of the named addressee(s). It may not be disclosed to any other person without the express authority of the HPA, or the intended recipient, or both. If you are not the intended recipient, you must not disclose, copy, distribute or retain this message or any part of it. This footnote also confirms that this EMail has been swept for computer viruses, but please re-sweep any attachments before opening or saving. HTTP://www.HPA.org.uk ********************************************************************** ****
ADD REPLYlink written 7.5 years ago by Paul-Michael Agapow120
Hi Paul, These failures may be due to the fact that many browsers will simply fail to upload files > 2GB (although I know there are people out there have successfully done it). Tilahun, I'll echo Paul's suggestion to use the Production setup. There are also alternative options for getting data into Galaxy. For users, you can have an FTP server (or just a local directory on the server where they can place files for upload): https://bitbucket.org/galaxy/galaxy-central/wiki/Config/UploadViaFTP Or you can use data libraries and load directly off filesystems accessible to the server: https://bitbucket.org/galaxy/galaxy- central/wiki/DataLibraries/UploadingFiles --nate
ADD REPLYlink written 7.5 years ago by Nate Coraor3.2k
Thank you Paul and Nate. We will try the options you suggested. One thing we found to work much faster (less than 5 minutes for most files) is to upload zipped data. Galaxy could upload and unzip the files without any problem. It doesn't seem any sequence data is lost. Has anyone tried this before? Thanks. Tilahun
ADD REPLYlink written 7.5 years ago by Tilahun Abebe40
Yes, this is a standard feature. zip, gzip, and bzip2 are all supported. Only one file per archive at this time, however. --nate
ADD REPLYlink written 7.5 years ago by Nate Coraor3.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 169 users visited in the last hour