3 months ago by
United States
Hello,
Technically, the only data you need to retain are datasets that are currently being used as inputs or that you plan to use as inputs in downstream steps. Also, consider loading the fastq data in a compressed format - that will also reduce the quota space used. All of the newer wrappers accept compressed fastqsanger.gz inputs but some of the older wrappers do not. For example: use HISAT2 and avoid Tophat.
In the context of an analysis, deciding to remove data involves ensuring that a step you think is completed is not only green (a "successful" job) but actually has the best content. Sometimes it takes a few runs to tune parameters optimally. Making decisions about upstream parameters often requires scientific review of downstream summary/data reduction results.
One strategy is to download/save back locally all intermediate datasets (in case you need them again), then permanently delete in Galaxy to recover space (deleting is not enough, the data must be permanently deleted aka "purged"). Make certain downloads are complete -- curl/wget are a good choice for larger data.
You can also set up a workflow so that intermediate datasets are purged while it is running, just be aware that if the final results are not what you want, you'll need to run the entire workflow again after making adjustments to improve the final content and won't have access to the intermediate datasets for review. So, that is often a better choice when running established workflows on batches of data (often bundled into dataset collections).
It is also important to know that larger data might exceed Galaxy's processing resources, in particular when using a public server. If that happens, then you'll need to move to your own Galaxy. Cloudman is a common choice.
Details for all of the above can be found in these resources:
Thanks! Jen, Galaxy team