I have posted this question before and did not receive any responses. Let me try to articulate the issue more effectively this time. I have 200 samples (specifically from 200 different coyotes). I am trying to detect SNPs in the MHC gene. I have created a workflow in galaxy and would like to be able to use this to automate all of the steps required. At the end I would like all of the data to be compiled in one vcf file listed by sample name. The problem is I cannot find a way to have each sample maintain it's unique sample name through the workflow without manually entering this at some point. Any ideas?
I presume you have targeted sequencing data (e.g., 200 fastq datasets) for the 200 individuals. If this is correct one of the approaches will be to create an unaligned BAM file combining data for all coyotes with individuals labelled using read groups (see http://bit.ly/1H1v48z for explanation of read groups). This can be done within or outside of Galaxy.
Next, you will align reads within the unaligned BAM using BWA (do you have coyote genome?). BWA accepts unaligned BAM files as an input and will generate an aligned BAM with readgroup data. You will then feed this BAM dataset to a variant caller such as FreeBayes and as the output you will have a VCF file with 200 samples called individually.
However, one of the key questions is what reference genome you would like to align to and call variants against?
Anton,
First off thank you so much for taking the time to respond, I have been stuck on this for quite a while. You are correct, I have 2 fastq files (read1 and read2) for each individual. Do you have information about about creating this combined BAM file using Galaxye?
About the genome, we are interested in looking at variants in the Canis familiars MHC class II DLA DRB1 beta chain, exon 2. I went to NCBI to get the nucleotide sequence for this using accession number: U47338.1 and created my own fasta file. I have been using BWA to align to this sequence, should I be aligning to the entire genome?
Your suggestions have been incredibly helpful but my main question is do I have to assign sample names individual? I have been doing this manually for each sample using Picard Tools. It would be great if there was a tool that could take the name of the original fastq file (which has the sample name in it) and use that to populate the sample name parameter in Picard tools. I don't believe this functionality exists unless I am missing something.
Creating a dataset collection from your fastq files would assign the name of the file as the identifier in the collection. I have recently been working on updating the devteam tools to allow them to automatically pull this information out the collection and assign read groups based on it. I am not done yet - and even after I am it will be some time before the tools are published to usegalaxy.org say - but we are working on the problem.
https://github.com/jmchilton/galaxy/commit/4919bf747633a383900895b26c593edbce302361
I can see that taking info from dataset collection metadata would indeed be very useful. But would´t it be easier and possibly more generally useful to take info from the individual dataset names (as most info sample, read group and library is usually there)?
I agree with Guy, it seems counter-intuitive that tools that require a sample name or read group would not automatically select the original name of the input file as the default. This way if the original fastq file was "sample_5" then this value should be populated as the "sample name" and any downstream tools that would display this string variable would have the correct info without the need to manually enter this data.
If you create say a list with an uploaded "sample_5" dataset - the collection identifier is automatically assigned to "sample_5" - the user will be able to change this at creation in the future - but this is and will remain the default.