Creating a Workflow for Variant Detection for 200 samples

Question: Creating a Workflow for Variant Detection for 200 samples

3.6 years ago by

gkuffel22 • 170

United States

gkuffel22 • 170 wrote:

I have posted this question before and did not receive any responses. Let me try to articulate the issue more effectively this time. I have 200 samples (specifically from 200 different coyotes). I am trying to detect SNPs in the MHC gene. I have created a workflow in galaxy and would like to be able to use this to automate all of the steps required. At the end I would like all of the data to be compiled in one vcf file listed by sample name. The problem is I cannot find a way to have each sample maintain it's unique sample name through the workflow without manually entering this at some point. Any ideas?

snps galaxy • 1.3k views

ADD COMMENT • link •

modified 3.6 years ago by Anton Nekrutenko ♦ 1.7k • written 3.6 years ago by gkuffel22 • 170

3.6 years ago by

Anton Nekrutenko ♦ 1.7k

Penn State

Anton Nekrutenko ♦ 1.7k wrote:

I presume you have targeted sequencing data (e.g., 200 fastq datasets) for the 200 individuals. If this is correct one of the approaches will be to create an unaligned BAM file combining data for all coyotes with individuals labelled using read groups (see http://bit.ly/1H1v48z for explanation of read groups). This can be done within or outside of Galaxy.

Next, you will align reads within the unaligned BAM using BWA (do you have coyote genome?). BWA accepts unaligned BAM files as an input and will generate an aligned BAM with readgroup data. You will then feed this BAM dataset to a variant caller such as FreeBayes and as the output you will have a VCF file with 200 samples called individually.

However, one of the key questions is what reference genome you would like to align to and call variants against?

ADD COMMENT • link written 3.6 years ago by Anton Nekrutenko ♦ 1.7k

Anton,

First off thank you so much for taking the time to respond, I have been stuck on this for quite a while. You are correct, I have 2 fastq files (read1 and read2) for each individual. Do you have information about about creating this combined BAM file using Galaxye?

About the genome, we are interested in looking at variants in the Canis familiars MHC class II DLA DRB1 beta chain, exon 2. I went to NCBI to get the nucleotide sequence for this using accession number: U47338.1 and created my own fasta file. I have been using BWA to align to this sequence, should I be aligning to the entire genome?

ADD REPLY • link written 3.6 years ago by gkuffel22 • 170

Your suggestions have been incredibly helpful but my main question is do I have to assign sample names individual? I have been doing this manually for each sample using Picard Tools. It would be great if there was a tool that could take the name of the original fastq file (which has the sample name in it) and use that to populate the sample name parameter in Picard tools. I don't believe this functionality exists unless I am missing something.

ADD REPLY • link written 3.6 years ago by gkuffel22 • 170

Creating a dataset collection from your fastq files would assign the name of the file as the identifier in the collection. I have recently been working on updating the devteam tools to allow them to automatically pull this information out the collection and assign read groups based on it. I am not done yet - and even after I am it will be some time before the tools are published to usegalaxy.org say - but we are working on the problem.

https://github.com/jmchilton/galaxy/commit/4919bf747633a383900895b26c593edbce302361

ADD REPLY • link written 3.6 years ago by jmchilton ♦ 1.1k

I can see that taking info from dataset collection metadata would indeed be very useful. But would´t it be easier and possibly more generally useful to take info from the individual dataset names (as most info sample, read group and library is usually there)?

ADD REPLY • link written 3.6 years ago by Guy Reeves • 1.0k

I agree with Guy, it seems counter-intuitive that tools that require a sample name or read group would not automatically select the original name of the input file as the default. This way if the original fastq file was "sample_5" then this value should be populated as the "sample name" and any downstream tools that would display this string variable would have the correct info without the need to manually enter this data.

ADD REPLY • link written 3.6 years ago by gkuffel22 • 170

If you create say a list with an uploaded "sample_5" dataset - the collection identifier is automatically assigned to "sample_5" - the user will be able to change this at creation in the future - but this is and will remain the default.

ADD REPLY • link written 3.6 years ago by jmchilton ♦ 1.1k

Thanks John.I can see that will be useful useful but is there any chance you could create additional metadata fields for collections other than 'name' where the additional fields could be used by tools as parameters (e.g. by mappers such as BWA -read group, sample and library)?

ADD REPLY • link written 3.6 years ago by Guy Reeves • 1.0k

Similar posts • Search »