Dear Galaxy Community,
I have actually installed Galaxy on our cluster, and I am now trying to design a workflow to process our data. However, I am facing a "technical" issue and would like to have your suggestions on how I could solve it.
I have around 1,200 Fastq bs-seq datasets, which I want to align on a modified reference genome. Each dataset comes from a different individual, for which I have SNP information in VCF format. I would like my workflow to substitute the reference genome with the SNP (this can be done easily with bcftools or vcf tools consensus), index this substituted genome and then align the Fastq sequences on the substituted indexed genome (with Bismark).
At first, it seemed to be pretty straightforward to me. However, to run this pipeline, I need two clicks for each individuals: one to select the Fastq file, and one to select the VCF file (and a third click to press "execute, of course!). As I have 1,200 individuals (and will have more in the future), this is very laborious and error prone.
What I would like, is to be able to somehow "link" together the corresponding VCF and Fastq files for each individual, and then run the pipeline on on several individuals at the same time using something like the "multiple datasets" option normally available with any tools.
Is there a way to do that? I initially thought this could possibly be done using the "dataset collection" functionality, but from what I have read it only works with 2 files of the same type. Also, as the VCF and Fastq files are not used during the same step (and not with the same tool) of the workflow, it is problematic.
For information, my Fastq and VCF files are (at the moment) stored in data libraries in Galaxy
I am open to any suggestions, and I thank you in advance for your help!