Actually, the FreeBayes variant calling tool generates one VCF file for "all" BAM files selected (or within a chosen collection). This is counterintuitive compared with most Galaxy Tools. This is a problem because all variants are stored in one VCF with no identification of the sample presenting the variant. Is there a way of running FreeBayes individually for each BAM in a collection ? or Is there a way to identify the sample using FreeBayes ?
Yes, this is how the Freebayes tool form works right now, as it accepts multiple BAM files as an input type. You could open an enhancement request against the tool repository to expand functionality when using a Dataset Collection (find the repository link in the Tool Shed).
Many tools will include per-BAM read group information in the output, including Freebayes. Add that to your BAM inputs or modify your workflow to add this in during mapping. Or consider adding in tags from the start.
Tutorials that cover Variant Analysis and Collections+Tags: https://galaxyproject.org/learn/
Thanks, Jen, Galaxy team
Freebayes is like GATK a joint variant caller, it uses information from multiple samples to asses the likelihood that a real variant exists. That is its strength. If you wish to run it sample by sample you will likely get lots of missing information. If in sample A POSition 100 is 0/0 and in for sample B POSition 100 is 0/1. When run singly in the sample A individual .vcf position 100 will not exist but in sample B individual .vcf position 100 will exist. If you run them jointly then you will data from both will exist POS 100= 0/0 & 0/1.
You will need to think about this before you run everything individually. You may want to give freebayes as list of sites you want it to output information for.
It is very easy to run files individually just goto the tool click on the tab that looks like a pile of papers and select the files you want from the list, everything will be run individually.
Identifying each sample in a vcf could be tough, but Jens idea of using tags is a good one. though could get involved if you have 100s of samples in a VCF
Jen and Guy, thanks for your answers. Guy, I agree with you and I understood the risks of running that individually but my goal was identifying the sample(s) where each sample comes from (the BAM file or so). I know that samples may also come in columns within the VCF but Freebayes shows only one column. Besides, adding columns per sample would be, perhaps, inefficient when most of the variants (from tumors) show sample-specific variants.
So, my concern is only to identify the BAM(s)/sample(s) that showed that variant position for validation and post-processing purposes (to avoid searching all BAM individually). At first glimpse, I can't figure out the Jan idea of using tags. Jan, could you be more specific? do you mean using tags in each bam? or hash tags?, if so, I don´t see how these will propagate into the content of the VCF file to identify samples per variant using Freebayes. I understood that Galaxy tags propagates for "files" instead to "within the content of files". So, honestly, I did not grasp the Jen idea. I am willing to try, but need more information from Jen.
Another approach could be to modify Freebayes to include this info in some way (perhaps using an info field, right?). I´m willing to make this modification to the Freebayes code (I guess it is available), do you think it worths?
Hope getting your feedback,
If I understand what you want to do the Tag idea will not be helpful. Also the concern about missing data in my original reply probably also does not apply
The dataset tag was one idea to help you group inputs and outputs within the history. It is a convience feature for navigating a long list of datasets (and collections, one or more) and their relationships. Dataset/collection tags do not modify the result dataset(s) internal content.
The ulitity of tags when running each BAM individually is up for you to decide and with hundreds of inputs, likely tedious to do, but is an option. See the help I linked for how-to.
Identifying source BAMs within the VCF output itself (when entering combined inputs), using read group info, if it is not already producing the info you want (there is some info present when read groups are defined per-BAM) would be an a distinct enhancement, with distinct utility, to this tool wrapper itself.
But please test out Guys suggestions first about read groups - this seems like the best option with the current tool implementation. And may be enough.
I guess the key part of your message is "my goal was identifying the sample(s) where each sample comes from (the BAM file or so".
If I understand correctly. what about modifying the SM: in the BAM file headers so that each BAM has a unique SM:. For an individual X. with 3 BAMs with each having SM:x.1 , SM:x.2 and SM:x.3
Then in the .freebayes output .vcf there will be individual columns for each input BAM, even if you call all BAMs jointly or separately.
you can look at the BAM header with "Convert, Merge, Randomize BAM datasets and perform other transformations" > header option.
You can edit BAM headers with "AddOrReplaceReadGroups add or replaces read group information". also gives very good explanation of what SM and RG is and its significance.
both tools are available on Usegalaxy.
If you wanted to change the code of freebayes them maybe you could simply substitute the sample name in the .vcf with the BAM RG:.Which presumably should be unique to each BAM
Did I understand what you wanted to do ??