Question: Multiple input for samtools mpileup (-b option) for GBS pipeline
2.7 years ago by
fiebig30 wrote:

Hello, I'm working on an GBS analysis workflow which was previously designed for command line usage. It should analyse more than 50 samples. Steps:

  1. Trimming of fastq-input
  2. Mapping =>multiple sorted and indexed BAM files, one per sample

My problem starts here:

I want to use the BAM files as input for variant detection in samtools mpileup. On command line, I simply read the path of all BAM files I have created in the previous steps to a list of BAM files and hand this file to mpileup using the -b bamlist.txt option. Doing so, I receive one VCF file storing information of multiple samples I completely failed to reproduce this result in Galaxy. The only to state every BAM file using the mpileup interface "by hand". This will produce the desired file. With two files I got the following Log file

[mpileup] 2 samples in 2 input files <mpileup> Set max per-file depth to 4000

But it is not practicable for more than 10 samples...

So far, I tried Galaxys "multiple input" option as well as the "data list collection" - still every BAM input is treated as a single input file resulting in one VCF per BAM instead of one VCF covering multiple samples.

Anybody here who went into the same problems and knows the trick? Is there a possibility to hand over BAM input to mpileup dynamically?

Every help would be appreciated. Maybe I did not understand the problem. I have the same trouble trying to merge BAM files...

Best regards, Anne

modified 2.7 years ago by Jennifer Hillman Jackson25k • written 2.7 years ago by fiebig30
2.7 years ago by
United States
Jennifer Hillman Jackson25k wrote:


I suggest running each BAM dataset individually through this tool (multiple or collection). Next, merge the resulting VCF files using "VFCsort" followed by "VCFcombine".

Once you have a working analysis path, consider placing the tools into a workflow for re-use.

Best, Jen, Galaxy team

written 2.7 years ago by Jennifer Hillman Jackson25k

Ok, I gave it a try and basically it will do the job. There's only one small difference in the results file between mpileup -b and vcfsort: If one sample has no variants for a distinct position I got an "0,0,0" previously. vcfsort will set it to "." - a minor issue, that can be fixed easely.

Thanks a lot for your helpful suggestion!

written 2.7 years ago by fiebig30

Update: The tool has been modified. If read groups are included in the input BAM datasets (@RG) multiple inputs will result in a combined output.

From the tool form:

What it does

Report variants for one or multiple BAM files. Alignment records are grouped by sample identifiers in @RG header lines. If sample identifiers are absent, each input file is regarded as one sample.

written 9 months ago by Jennifer Hillman Jackson25k
