Question: Merging Multiple Unaligned Bam files
gravatar for gkuffel22
3.6 years ago by
United States
gkuffel22170 wrote:

Hi everyone,

I am developing a workflow in Galaxy for Variant Analysis and have hit yet another roadblock. I have 200 samples and paired end fastq files. I have managed to groom and trim all of the samples using a workflow, I have converted each fastq file to a BAM file (manually, in order to add sample name and read group info). My next problem is I would like to merge all of the BAM file into one BAM file and then use BWA to align the data. I am trying to use BAM tools and although this tool does take in multiple files as input, instead of combining all of the files into one, the tool attempts to merge each dataset for a total of 200 merges with an output of 100 files. Is there a way of doing this using the multiple files input or do I have to also perform this function by inputing these files one by one manually?


bam tools workflow galaxy • 1.1k views
ADD COMMENTlink modified 3.6 years ago by Jennifer Hillman Jackson25k • written 3.6 years ago by gkuffel22170
gravatar for Jennifer Hillman Jackson
3.6 years ago by
United States
Jennifer Hillman Jackson25k wrote:


It might be better to merge the BAM files after alignment and filtering, otherwise the job could get too large to map in one go.

And be sure to watch your quota usage if using I am not sure how large each of these datasets are, but a 250G account quota and 200 input datasets could be a problem. You may need to run the first steps in batch anyway, preserving just the results, and purging intermediate datasets as you go along. If you are using a local or cloud that you administer with more data storage, then this is not an issue.

On the BWA tool form, you can select multiple BAM datasets and the tool will run a mapping job on each pair individually. I would not attempt to do all 200 in the same run. Maybe batch these into smaller groups. Just make certain that the paired BAM sequence inputs are in the same order when selecting them for each batch run. You want the paired-ends in two files for each BWA run (do not merge both ends of a paired dataset together before mapping - the tool expects for there to be two sequence input datasets per mapping job).

After that, you will most likely want to do the filter steps (each can also be done in batch, on each of the 100 mapping results, but again, maybe not all 100 in one go). You can do this all in one history or use multiple histories, putting final results in the same history at the end (leaving intermediate datasets behind) - use the History menu function "Copy Datasets". 

Then, before the variant calling step (correct?), is the time to merge the read group labeled, mapped, and filtered BAM files all together into one.

Now - this step is where it can get tricky until Dataset Collections are fully implemented (as John responded earlier to a prior question about this workflow). And using the API  instead of the UI will help considerably (as Dan responded earlier).  For any tool where you want all of the inputs to be used in the same job - do NOT use "Multiple Datasets" - as this will run the job on each input individually. All of the inputs need to be entered on the form if using the UI - and this will be tedious to do for 100 inputs. You could batch this as well, then merge group results, but that is for you to decide. I am not exactly sure how 100 inputs will function in the UI tool form.

Hopefully this helps. The punch-line is that the API is best for high-throughput work right now. And even after Dataset Collections is implemented, it is still a good way to run workflows with so many inputs.

Others from our team and community may offer more advice (welcomed!), Jen, Galaxy team


ADD COMMENTlink written 3.6 years ago by Jennifer Hillman Jackson25k

Jen, just want to say thank you. Your answers are so useful and so detailed. Always worth a read!

ADD REPLYlink written 3.6 years ago by Bjoern Gruening5.1k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 183 users visited in the last hour