Batch workflow with two different inputs each time?

Question: Batch workflow with two different inputs each time?

19 months ago by

Dear Galaxy Community,

I have actually installed Galaxy on our cluster, and I am now trying to design a workflow to process our data. However, I am facing a "technical" issue and would like to have your suggestions on how I could solve it.

I have around 1,200 Fastq bs-seq datasets, which I want to align on a modified reference genome. Each dataset comes from a different individual, for which I have SNP information in VCF format. I would like my workflow to substitute the reference genome with the SNP (this can be done easily with bcftools or vcf tools consensus), index this substituted genome and then align the Fastq sequences on the substituted indexed genome (with Bismark).

At first, it seemed to be pretty straightforward to me. However, to run this pipeline, I need two clicks for each individuals: one to select the Fastq file, and one to select the VCF file (and a third click to press "execute, of course!). As I have 1,200 individuals (and will have more in the future), this is very laborious and error prone.

What I would like, is to be able to somehow "link" together the corresponding VCF and Fastq files for each individual, and then run the pipeline on on several individuals at the same time using something like the "multiple datasets" option normally available with any tools.

Is there a way to do that? I initially thought this could possibly be done using the "dataset collection" functionality, but from what I have read it only works with 2 files of the same type. Also, as the VCF and Fastq files are not used during the same step (and not with the same tool) of the workflow, it is problematic.

For information, my Fastq and VCF files are (at the moment) stored in data libraries in Galaxy

I am open to any suggestions, and I thank you in advance for your help!

Sincerely

David

multiple inputs workflow alignment galaxy batch • 684 views

ADD COMMENT • link •

modified 19 months ago by jmchilton ♦ 1.1k • written 19 months ago by david.roquis • 30

This is an interesting use case that does not have a solution yet through the UI. It could possibly be solved by writing a script and making use of the API (do you have programming resource?). We are discussing and more feedback soon, likely as a ticketed future enhancement idea, for the UI implementation. Thanks! Jen, Galaxy team

ADD REPLY • link modified 19 months ago • written 19 months ago by Jennifer Hillman Jackson ♦ 25k

Hi Jen and Galaxy Team,

Thank you for your very quick answer. I am a biologist with some IT skills, but not really at scripting. It is a reason why I went for Galaxy, as the pipeline scripts left by our previous bioinformatician were not really flexible. I'll ask a friend to see if he can help me with a script, but I would definitely be interested if there is a ticket for a future enhancement. I will post the script here if I can make it.

Sincerely

David

ADD REPLY • link written 19 months ago by david.roquis • 30

19 months ago by

jmchilton ♦ 1.1k

United States

jmchilton ♦ 1.1k wrote:

I think you are essentially right about the concept of "paired" lists in Galaxy - they are meant to be of the same datatype. That said I think collections might still be the tool to use in general. Collections have the concept of order and identifiers for elements that are meant to allow this sort of computation.

So if there are different individuals and they have some sort of participant identifier or something - pid0001, pid0002, pid0003, .... pid1200. Then you could in theory create two lists - one for VCF files and one for FASTQ files with matching identifiers in a matching order. While the GUI will let you do this - it would be tremendously onerous and error prone to do by hand. I think you would want an API script or something to set these up. The next release of Galaxy - 17.05 will feature the ability to create collections from folders - so as long as these libraries are created in such a way that the VCFs are in one folder and the Fastq files are in another and they are in the same order - this may as a way to create these collections in the near future (see https://github.com/galaxyproject/galaxy/pull/3559 for the implementation of this feature by Marius van den Beek).

Once you have these collections - you can supply them to a tool or to a workflow and Galaxy will match up the elements by their order in the lists as it processes things and things should "just work" the way you would like.

So it would "just work" if you can create the correct two lists - but creating the correct two lists is kind of tough.

But... there is a hacky work around to create those lists in a fairly robust way if you are willing to cheat a bit I think. The paired list creator doesn't actually enforce that the things in the paired list have the same datatype. So you can use it and its ability to match things based on regex to create a list where each VCF is the "forward" element of a pair and the matching "Fastq" is the "reverse" element. You can then apply the "unzip" tool that is distributed with Galaxy to that paired list and you will have two lists that have homogenous datatypes (VCF and fastq respectively) and these lists can be applied to your workflow or the tools as needed. This terrible workaround is a modality of collection creation that Galaxy should directly support - I've created an issue to track this - this can be found here https://github.com/galaxyproject/galaxy/issues/3916.

Thanks for your interest and I hope something like this proves workable.

ADD COMMENT • link written 19 months ago by jmchilton ♦ 1.1k

Hi,

Thanks for this very detailed answer and this workaround! I will give it a try and keep it updated here!

Sincerely

David

ADD REPLY • link written 19 months ago by david.roquis • 30

Hello again!

I have been trying to follow your suggestion to build my workflow, but I am blocking at some point. As you explained, I have created a list of of paired datasets, with one "mate" being the vcf file and the other one being the corresponding fastq file. However, while building my workflow, I have been unable to find this "unzip" tool you mention to split the list of of paired datasets in two lists (one with all the forward "vcf" and another one with all the reverse "fastq"). May I ask you where I could find it?

I guess than when I have this tool, I can simply put it between "input dataset collection" and both BCFtools and Bismark in my workflow. Sorry if my questions are a bit naive.

Thanks a lot

David

ADD REPLY • link written 19 months ago by david.roquis • 30

Hi David, the tool to use is Collection Operations > Unzip Collection. Thanks! Jen, Galaxy team

ADD REPLY • link written 19 months ago by Jennifer Hillman Jackson ♦ 25k

Thanks! For some reason, Collection Operations tools where not present in my tool menu, but the problem is now fixed!

Sincerely

David

ADD REPLY • link written 19 months ago by david.roquis • 30

Similar posts • Search »