3.6 years ago by
United States
Hello,
It might be better to merge the BAM files after alignment and filtering, otherwise the job could get too large to map in one go.
And be sure to watch your quota usage if using http://usegalaxy.org. I am not sure how large each of these datasets are, but a 250G account quota and 200 input datasets could be a problem. You may need to run the first steps in batch anyway, preserving just the results, and purging intermediate datasets as you go along. If you are using a local or cloud that you administer with more data storage, then this is not an issue.
On the BWA tool form, you can select multiple BAM datasets and the tool will run a mapping job on each pair individually. I would not attempt to do all 200 in the same run. Maybe batch these into smaller groups. Just make certain that the paired BAM sequence inputs are in the same order when selecting them for each batch run. You want the paired-ends in two files for each BWA run (do not merge both ends of a paired dataset together before mapping - the tool expects for there to be two sequence input datasets per mapping job).
After that, you will most likely want to do the filter steps (each can also be done in batch, on each of the 100 mapping results, but again, maybe not all 100 in one go). You can do this all in one history or use multiple histories, putting final results in the same history at the end (leaving intermediate datasets behind) - use the History menu function "Copy Datasets".
Then, before the variant calling step (correct?), is the time to merge the read group labeled, mapped, and filtered BAM files all together into one.
Now - this step is where it can get tricky until Dataset Collections are fully implemented (as John responded earlier to a prior question about this workflow). And using the API instead of the UI will help considerably (as Dan responded earlier). For any tool where you want all of the inputs to be used in the same job - do NOT use "Multiple Datasets" - as this will run the job on each input individually. All of the inputs need to be entered on the form if using the UI - and this will be tedious to do for 100 inputs. You could batch this as well, then merge group results, but that is for you to decide. I am not exactly sure how 100 inputs will function in the UI tool form.
Hopefully this helps. The punch-line is that the API is best for high-throughput work right now. And even after Dataset Collections is implemented, it is still a good way to run workflows with so many inputs.
Others from our team and community may offer more advice (welcomed!), Jen, Galaxy team