Question: mixing collections and "multiple" input tools
gravatar for jdv
20 months ago by
jdv20 wrote:

I'm trying to modify a workflow to work on list collections. Most tools work as expected without change, but I'm running into difficulty with tools that have an input with the 'multiple="true"' attribute set. For these tools, the input in question is expanded into ALL inputs in the collection, which is not the correct (in my opinion) behavior.

For example, I use BWA-MEM to map reads against a contaminant database and then Peter Cock's "filter seq by mapping" tool to discard mapped reads. This tool takes a single FASTQ file but allows multiple BAM files to filter on. It works fine for single inputs, but when a dataset collection is used upstream, it maps each FASTQ file in the collection against all BAM outputs from the previous BWA-MEM step (instead of only the BAM file resulting from mapping that specific FASTQ file).

I have no problem modifying the filtering tool (I'm running on a local instance) but I'm not sure how to get things to work nicely together. Has anyone else run into this problem?

Thanks, Jeremy

collections • 554 views
ADD COMMENTlink written 20 months ago by jdv20

Hi Jeremy,

what happens if you insert one FASTQ file, but 2 BAM single BAM files. Would you consider this as correct output? Just looking at the code I don't see why the collection behavior should be different than the multiple-single input one.

ADD REPLYlink written 19 months ago by Bjoern Gruening5.1k


As I understand the tool (and have been using it), if you specify a single FASTQ and two BAM files (say the FASTQ file mapped against two different databases), the tool should discard (or keep, depending on settings) all reads with mappings in any of the BAM files. This has always worked as expected in my workflows with simple inputs.

If I introduce list collections as input, I seem to run into problems. To give a barebones example, let's say I have two read files (raw_A.fq and raw_B.fq) from two different samples in a dataset collection. For each, I want to filter out reads mapping to two contaminant databases (db_X and db_Y). I take the dataset collection as input into two parallel BWA-MEM steps mapping each input against the two databases. The two resulting BAM files are fed into the filtering tool along with the original FASTQ.

Here is what the MWE looks like visually

After the mapping steps, there are four BAM datasets in the history. I'll call them:

  • raw_A_db_X.bam
  • raw_A_db_Y.bam
  • raw_B_db_X.bam
  • raw_B_db_Y.bam

Then, as expected, two filtering jobs are run, one for each FASTQ in the collection. Here is the problem. The expected behavior is that (1) raw_A.fq is filtered against raw_A_db_X.bam and raw_A_db_Y.bam and that (2) raw_B.fq is filtered against raw_B_db_X.bam and raw_B_db_Y.bam. The observed behavior (by looking at the actual commands run) is that each FASTQ file is filtered against all four BAM files. It seems that whatever internal mechanism Galaxy uses to keep track of which datasets within generated collections go together is not playing nicely with the "multiple=true" attribute of the BAM input parameter in the filter tool.

My temporary fix has been to hack the filtering tool wrapper to remove the "multiple=true" attribute. With a single input BAM allowed, everything works as expected. For instance, with the same raw_A.fq and raw_B.fq and a single database db_Z, I get two BAM outputs:

  • raw_A_db_Z.bam
  • raw_B_db_Z.bam

and during the filtering step, raw_A.fq is filtered against raw_A_db_Z.bam and raw_B.fq is filtered against raw_B_db_Z.bam (as expected). If I run the exact same workflow using the unmodified wrapper, each FASTQ is filtered against both BAM files.

Sorry for the long-winded reply, but I hope this helps clarify the issue I'm observing.

ADD REPLYlink modified 19 months ago • written 19 months ago by jdv20
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 150 users visited in the last hour