Output dataset collection is empty

Question: Output dataset collection is empty

22 months ago by

i-guigon • 80 wrote:

Hello everyone

I'm currently developping a tool which needs as input an RData object and a variable number of tabular files, and which will return another RData object and a variable number of tabular output files (as many as the input files) so I thought the dataset collections would be an interesting feature but I have some problems.

I have no problem with the input dataset collection but concerning the output I get an empty dataset collection in my history.

I've found a fair number of examples on the Web (like here ) but some points are unclear to me.

I didn't get the difference between tool > outputs > data > discover_datasets and tool > outputs > collection > discover_datasets in the official documentation.

Concerning the "pattern" attribute, I've seen examples with

pattern="(?P&lt;name&gt;.*)"

pattern="(?P&lt;designation&gt;.+)\.report\.tsv"

so I don't know if I should use "name" or "designation"...

I've also seen an interesting topic here, the examples work for me but then when I tried to apply this to my code I got an empty dataset. Moreover I would prefer to use data collection. If I have 15 input files I'll have 15 output files as well and I don't want to spam my history...

Here is my xml:

<tool id="significant_regions" name="Significant regions" version="0.01">
    <command><![CDATA[
        Rscript --vanilla ${__tool_directory__}/significant_regions.R
        --rdata ${rdata}
        --merged_windows_dir ${__new_file_path__}/merged_windows_dir
    ]]></command>
    <inputs>
        <param name="rdata" format="rdata" type="data" label="--rdata" help="Input RData file (required)" />
        <param name="collect_results" format="tabular" type="data_collection" collection_type="list" label="--results" help="A table or list of tables from differential analysis (required)" />
    </inputs>
    <outputs>
        <data format="rdata" name="outr" label="output.RData" />
        <collection type="list" label="${collect_results.name} merged windows collection" name="merged_windows_collection">
            <discover_datasets pattern="(?P&lt;designation&gt;.*)\.csv" directory="${__new_file_path__}/merged_windows_dir" visible="true" />
        </collection>
    </outputs>
</tool>

--merged_windows_dir is the directory in which the results files will be created. If I fill this parameter with "merged_windows_dir" my results will be stored in the working directory (galaxy/database/jobs_directory/000/XXX/working/merged_windows_dir) but since it's a temporary directory, it is deleted at the end of the job and my job appears in red in the history as it cannot find the output file.

If I try with

--merged_windows_dir ${__new_file_path__}/merged_windows_dir

and

<discover_datasets [...] directory="${__new_file_path__}/merged_windows_dir"

My output files are correctly created (and still accessible after the job is finished) at galaxy/database/tmp/merged_windows_dir but I probably have a problem with the discover_datasets tag in the xml since the dataset is empty.

What did I do wrong?

Thank you for your help.

discover_datasets dataset collection galaxy • 907 views

ADD COMMENT • link •

modified 22 months ago • written 22 months ago by i-guigon • 80

22 months ago by

jmchilton ♦ 1.1k

United States

jmchilton ♦ 1.1k wrote:

I'd strongly encourage you to do this without the __new_file_path__ stuff - that will be interpreted literally instead of being expanded in various places. You tool is given a clean working directory - I would just create files there - it works more cleanly and robustly with different deployment scenarios to it that way anyway.

The following variant is something I think is closer to working in the abstract:

<tool id="significant_regions" name="Significant regions" version="0.01">
    <command><![CDATA[
        mkdir -p merged_windows_dir && 
        Rscript --vanilla ${__tool_directory__}/significant_regions.R
        --rdata ${rdata}
        --merged_windows_dir merged_windows_dir
    ]]></command>
    <inputs>
        <param name="rdata" format="rdata" type="data" label="--rdata" help="Input RData file (required)" />
        <param name="collect_results" format="tabular" type="data_collection" collection_type="list" label="--results" help="A table or list of tables from differential analysis (required)" />
    </inputs>
    <outputs>
        <data format="rdata" name="outr" label="output.RData" />
        <collection type="list" label="${collect_results.name} merged windows collection" name="merged_windows_collection">
            <discover_datasets pattern="(?P&lt;designation&gt;.*)\.csv" directory="merged_windows_dir" visible="true" />
        </collection>
    </outputs>
</tool>

Let me know if this helps.

ADD COMMENT • link written 22 months ago by jmchilton ♦ 1.1k

Thank you for your answer. Unfortunately it still doesn't work. The job fails and here is what I get:

Error in file(file, ifelse(append, "a", "w")) : 
  impossible d'ouvrir la connexion
Calls: write.table -> file
De plus : Warning message:
In file(file, ifelse(append, "a", "w")) :
  impossible d'ouvrir le fichier 'merged_windows_dir/dmso_vs_gw_corrige_copie_merged_windows.csv' : Aucun fichier ou dossier de ce type
Exécution arrêtée

It cannot find the output files, and I can understand why: they are created in the working directory (working/merged_windowd_dir/ in galaxy/database/jobs_directory/000/XXX/) but the working directory is cleaned once the job is over so my output files are deleted! How am I supposed to get them? Isn't there a way to copy them in a safe directory that will not be deleted but where the files could be accessible? I don't understand what's wrong in my code :(

Btw, would you mind explaining me the "pattern" feature in <discover_datasets> ? I've seen tons of example but no clear explanation of what are the "name", "designation", "__name_and_ext__" and such?

Thank you again for your answer.

ADD REPLY • link written 22 months ago by i-guigon • 80

22 months ago by

i-guigon • 80

i-guigon • 80 wrote:

I found precious informations here in the "Number of Output datasets cannot be determined until tool run" section which helped me to (more or less) solve my problem (and helped me to understand the "pattern" feature in the same process).

I don't have the file universe_wsgi.ini, I found the option collect_outputs_from in galaxy/config/galaxy.ini and the line was inactivated. By un-commenting it I was able to get a non-empty output dataset collection.

Now I still have something weird, but I don't know if it's a normal behaviour or not... I get all my output files twice : one file in the dataset collection and one file in a separate dataset. So if I have 3 input files I'll get a collection with 3 output files + 3 datasets containing each of the 3 output files. One of the interest of using the dataset collections was to avoid spamming the history with a lot of output files... Is it possible to fix that?

Thank you for your help.

ADD COMMENT • link written 22 months ago by i-guigon • 80

Please log in to add an answer.

Similar posts • Search »