I am working on a project identifying SNPs/SNVs of a few sequences and was wondering if there is a good work flow for conducting such an analysis with Galaxy? I have vcf and BAM files (already indexed) available for use and would ultimately like to align them to a ref sequence and see what SNPs I can find. One small caveat: these files include whole chromosomes worth of data.
While I can find SNPs manually in IGV, it is tedious, and I am sure there is a way to solve this issue using bioinformatics. Ideally I would like to use vcf files, because they include genotypes.
Any suggestions? Is it even possible?
Thorough, step-by-step answers would be appreciated (I'm a rookie, can you tell?).
Thanks for your time in advance!
Would be happy to help, but I need a better understanding of what you mean by 'finding' SNPs. Also what is your .vcf file and your .bam file derived from. Your question is written very clearly and with some of this extra info I am fairly sure you will get help.
just now by
kdevitofranceschi • 10
Of course! Thank you for your response.
The vcf/BAM files were derived from Illumina HiSeq 2000 reads then mapped to the human genome (GRCh37).
As to what I mean with finding SNPs, these libraries have been created and I would like to find any SNPs present in the sample in comparison with the reference genome. So, for example if the reference nucleotide at position x is A, I would want to know if it's anything other than A in my sample. Does that make sense?
Thanks for your help. I greatly appreciate it!
So the .vcf is a file generated from the .bam file? If so they basically represent the diffrent formats of your same data?
If you want to know about all the sites where mapped reads were diffrent from the reference you mapped them to this exactly what the VCF file is, one line for each potentially variable site once you go past the header (press the eye icon in the dataset and then scroll down passed the header and then across the column headings). Each sample will have a column , there can be more than one sample in a file.
Does reading about the .vcf file format help in the following link?
But caution the list in the Vcf is all of potential variants and may well include a load of sites which are low confidence errors (from sequencing and mapping). Info about the quality of the variants in the is also in the .vcf. You then need to use a program to filter out low confidence calls to give you a list of probable SNPs or indels. An example of one program that can do this is the GATK 'unified genotypes' . Does any of this help? Guy