Question: Documentation for <collection> tag?
gravatar for mrals89
3.4 years ago by
United States
mrals8950 wrote:

Hello everyone,

I am looking for documentation on the collection tag, which I can see used in galaxy's tests, but not in its wiki. Specifically I'm looking for how to properly output a list of paired end datasets (and filter unless list:paired input). Do I need nested collection tags? Any help or response would be appreciated.

admin galaxy tool_config • 1.2k views
ADD COMMENTlink modified 3.4 years ago by Bjoern Gruening5.1k • written 3.4 years ago by mrals8950
gravatar for Bjoern Gruening
3.4 years ago by
Bjoern Gruening5.1k
Bjoern Gruening5.1k wrote:


have a look at the various examples here:



ADD COMMENTlink written 3.4 years ago by Bjoern Gruening5.1k

Bjoern, thanks for your response. I have looked these examples over quite a bit. However, none include an example of producing list:paired output explicitly. Can you advise?

I have tried the obvious step of nesting the collection types in the output.  e.g. with either list or list:paired as the parent collection 

    <collection name="paired_list_output" type="list" label="Subsampled ${on_string}">
      <collection name="pairs" type="paired">
    <data format="fastq"/>
    <data format="fastq"/>

This produces  

"implicit_collections": [],
"jobs": [],
"output_collections": [],
"outputs": []

Upon adding "structured_like" and "inherits_format" tags referencing the input list:paired collection, I notice that it captures the list element correctly, producing the dataset keys and filenames, respectively, from the following command. However, the second output should be a python DatasetCollectionWrapper object instead of a DatasetFilenameWrapper object. As a result, I cannot access "forward" or "reverse aspects of this collection

      #for $key in $paired_list_output.keys():
        echo $key;
        echo $paired_list_output[$key];
      #end for

I would really appreciate a concrete example (including all the "inherits_from" things) showing how to produce this output. For some tools, I think the correct list:paired output seems to be entirely produced from "implicit collections", if i understand its function correctly.When I try also producing a list:paired collection type as output, it is unclear how to access the individual pairs, and I receive key errors when using my_list_of_pairs.forward or my_list_of_pairs[$key].forward.



ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by mrals8950

Also, is there a simple method that I am unaware of to list available methods or keys, rather than hunting around for available methods or keys?

ADD REPLYlink written 3.4 years ago by mrals8950

A tool can explicitly produce a list or pair - but not a list of pairs yet. In your case - this should be really easy to implement and I will try to find some time to work on this.

My follow up would be though - do you really want to do this? In my experience most tools should just be implemented to consume a single dataset and the tool author should just allow the user to "map" the tool over the list of paired datasets to produce a new list of paired datasets. This is much more robust, easier for the tool author to implement, and allows Galaxy to exploit more parallelism naturally. This doesn't really work in the case where the underlying tool needs to actually use all inputs simultaneously in the same job - e.g. if there is some sort of normalization that is going to occur across all files.

ADD REPLYlink written 3.4 years ago by jmchilton1.1k

Yes, you're absolutely right. However, I didn't have an example to use to demonstrate this "implicit mapping" functionality. Is this documented anywhere? After a few hours of exploring, I figured out that the correct way was to use a paired input dataset as input (and Galaxy allows users to supply list:paired as input) and paired output, with a certain combination of structured-like and other tag attributes to get things working. So it seems to me that Galaxy implicitly understands that although the input and output are both explicitly paired in the xml, the tool can run with list:paired input and the tool is run on each pair and creates an list:paired output. I have seen examples of other conversions in the galaxy functional tests (single-> paired, etc) but couldn't find an explicit example of list:paired -> list:paired when in reality what was needed was paired->paired.

ADD REPLYlink written 3.4 years ago by mrals8950

Output collections aren't really documented anywhere - they are very new still. I will try to remedy this. I think of this presentation - - as the sort of canonical collections documentation and it doesn't demonstrate mapping a dataset operation of list of paired datasets to produce a list of paired datasets - and mapping a paired operation over a list of paired datasets  to produce a list of datasets. It is quite old however and doesn't cover this case of output collections. Improving collections documentation is high on my priority list however - especially for tool authors.

Out of curiosity - would you mean sharing what it is your tool does? I would like to understand the use case better.

ADD REPLYlink written 3.4 years ago by jmchilton1.1k

Gladly! That presentation was actually how I learned that collections were possible in tool configuration within galaxy. The tool subsamples paired-end fastq files, such that randomly selected mates are always kept or removed together. This requires paired->paired mapping, and often would occur on large list:paired collections in high-throughput scenarios. Reducing the input data sizes can be useful for QC of a large number of samples.

ADD REPLYlink written 3.4 years ago by mrals8950

Ah - okay. Yeah - that makes a lot of sense then - I think having the tool consume and produce dataset pairs is exactly right. Thanks for the update!

ADD REPLYlink written 3.4 years ago by jmchilton1.1k

..Thank you!

ADD REPLYlink written 3.4 years ago by mrals8950
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 170 users visited in the last hour