Question: Running out of space
0
gravatar for Anya.Nikolai
3 months ago by
Anya.Nikolai0 wrote:

Hello,

I'm trying to run a large dataset and I'm wondering at what point can I delete earlier steps to clear up space for later steps.

Like after I perform a trim can I delete the original FASTQ files? After I perform HISAT2 can I delete the trim files, etc.?

Thank you!

rna-seq account galaxy quota • 150 views
ADD COMMENTlink modified 3 months ago by Jennifer Hillman Jackson25k • written 3 months ago by Anya.Nikolai0
0
gravatar for Jennifer Hillman Jackson
3 months ago by
United States
Jennifer Hillman Jackson25k wrote:

Hello,

Technically, the only data you need to retain are datasets that are currently being used as inputs or that you plan to use as inputs in downstream steps. Also, consider loading the fastq data in a compressed format - that will also reduce the quota space used. All of the newer wrappers accept compressed fastqsanger.gz inputs but some of the older wrappers do not. For example: use HISAT2 and avoid Tophat.

In the context of an analysis, deciding to remove data involves ensuring that a step you think is completed is not only green (a "successful" job) but actually has the best content. Sometimes it takes a few runs to tune parameters optimally. Making decisions about upstream parameters often requires scientific review of downstream summary/data reduction results.

One strategy is to download/save back locally all intermediate datasets (in case you need them again), then permanently delete in Galaxy to recover space (deleting is not enough, the data must be permanently deleted aka "purged"). Make certain downloads are complete -- curl/wget are a good choice for larger data.

You can also set up a workflow so that intermediate datasets are purged while it is running, just be aware that if the final results are not what you want, you'll need to run the entire workflow again after making adjustments to improve the final content and won't have access to the intermediate datasets for review. So, that is often a better choice when running established workflows on batches of data (often bundled into dataset collections).

It is also important to know that larger data might exceed Galaxy's processing resources, in particular when using a public server. If that happens, then you'll need to move to your own Galaxy. Cloudman is a common choice.

Details for all of the above can be found in these resources:

Thanks! Jen, Galaxy team

ADD COMMENTlink written 3 months ago by Jennifer Hillman Jackson25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 168 users visited in the last hour