mixing collections and "multiple" input tools

Question: mixing collections and "multiple" input tools

2.0 years ago by

jdv • 20

jdv • 20 wrote:

I'm trying to modify a workflow to work on list collections. Most tools work as expected without change, but I'm running into difficulty with tools that have an input with the 'multiple="true"' attribute set. For these tools, the input in question is expanded into ALL inputs in the collection, which is not the correct (in my opinion) behavior.

For example, I use BWA-MEM to map reads against a contaminant database and then Peter Cock's "filter seq by mapping" tool to discard mapped reads. This tool takes a single FASTQ file but allows multiple BAM files to filter on. It works fine for single inputs, but when a dataset collection is used upstream, it maps each FASTQ file in the collection against all BAM outputs from the previous BWA-MEM step (instead of only the BAM file resulting from mapping that specific FASTQ file).

I have no problem modifying the filtering tool (I'm running on a local instance) but I'm not sure how to get things to work nicely together. Has anyone else run into this problem?

Thanks, Jeremy

collections • 726 views

ADD COMMENT • link •

written 2.0 years ago by jdv • 20

Hi Jeremy,

what happens if you insert one FASTQ file, but 2 BAM single BAM files. Would you consider this as correct output? Just looking at the code I don't see why the collection behavior should be different than the multiple-single input one.

ADD REPLY • link written 2.0 years ago by Bjoern Gruening ♦ 5.1k

Bjoern,

As I understand the tool (and have been using it), if you specify a single FASTQ and two BAM files (say the FASTQ file mapped against two different databases), the tool should discard (or keep, depending on settings) all reads with mappings in any of the BAM files. This has always worked as expected in my workflows with simple inputs.

If I introduce list collections as input, I seem to run into problems. To give a barebones example, let's say I have two read files (raw_A.fq and raw_B.fq) from two different samples in a dataset collection. For each, I want to filter out reads mapping to two contaminant databases (db_X and db_Y). I take the dataset collection as input into two parallel BWA-MEM steps mapping each input against the two databases. The two resulting BAM files are fed into the filtering tool along with the original FASTQ.

Here is what the MWE looks like visually

After the mapping steps, there are four BAM datasets in the history. I'll call them:

raw_A_db_X.bam
raw_A_db_Y.bam
raw_B_db_X.bam
raw_B_db_Y.bam

Then, as expected, two filtering jobs are run, one for each FASTQ in the collection. Here is the problem. The expected behavior is that (1) raw_A.fq is filtered against raw_A_db_X.bam and raw_A_db_Y.bam and that (2) raw_B.fq is filtered against raw_B_db_X.bam and raw_B_db_Y.bam. The observed behavior (by looking at the actual commands run) is that each FASTQ file is filtered against all four BAM files. It seems that whatever internal mechanism Galaxy uses to keep track of which datasets within generated collections go together is not playing nicely with the "multiple=true" attribute of the BAM input parameter in the filter tool.

My temporary fix has been to hack the filtering tool wrapper to remove the "multiple=true" attribute. With a single input BAM allowed, everything works as expected. For instance, with the same raw_A.fq and raw_B.fq and a single database db_Z, I get two BAM outputs:

raw_A_db_Z.bam
raw_B_db_Z.bam

and during the filtering step, raw_A.fq is filtered against raw_A_db_Z.bam and raw_B.fq is filtered against raw_B_db_Z.bam (as expected). If I run the exact same workflow using the unmodified wrapper, each FASTQ is filtered against both BAM files.

Sorry for the long-winded reply, but I hope this helps clarify the issue I'm observing.

ADD REPLY • link modified 2.0 years ago • written 2.0 years ago by jdv • 20

Similar posts • Search »