I have posted this question before and did not receive any responses. Let me try to articulate the issue more effectively this time. I have 200 samples (specifically from 200 different coyotes). I am trying to detect SNPs in the MHC gene. I have created a workflow in galaxy and would like to be able to use this to automate all of the steps required. At the end I would like all of the data to be compiled in one vcf file listed by sample name. The problem is I cannot find a way to have each sample maintain it's unique sample name through the workflow without manually entering this at some point. Any ideas?
I presume you have targeted sequencing data (e.g., 200 fastq datasets) for the 200 individuals. If this is correct one of the approaches will be to create an unaligned BAM file combining data for all coyotes with individuals labelled using read groups (see http://bit.ly/1H1v48z for explanation of read groups). This can be done within or outside of Galaxy.
Next, you will align reads within the unaligned BAM using BWA (do you have coyote genome?). BWA accepts unaligned BAM files as an input and will generate an aligned BAM with readgroup data. You will then feed this BAM dataset to a variant caller such as FreeBayes and as the output you will have a VCF file with 200 samples called individually.
However, one of the key questions is what reference genome you would like to align to and call variants against?