I like to know how to generate dbSNP "Reference-Ordered Data" (ROD) file from dbSNP data and is it possible
to generate it for per human chromosomes.
Appreciate your help.
Sincerely,
bio_vitus
I like to know how to generate dbSNP "Reference-Ordered Data" (ROD) file from dbSNP data and is it possible
to generate it for per human chromosomes.
Appreciate your help.
Sincerely,
bio_vitus
Hello,
ROD data is reference annotation that is anchored by genomic position. This makes it easy to compare two datasets together using position.
If you use the built-in human hg_g1k_v37 genome (1000 genomes), then dbSNP is already indexed (sourced from the GATK resource bundle). If you want to use hg38/hg19 or any other build, then the idea is to locate a VCF dataset based on the same exact reference data (dbSNP or any other annotation that you want to link in). GATK itself, NCBI, UCSC, and others can be good data sources. A google should help narrow down the choices.
This prior Biostars question addresses essentially the same question with the bonus of covering the importance of confirming format (including the ordering of chromosomes in input data). Hopefully it will help: https://www.biostars.org/p/8212/
This might help as well, when deciding on which genome to use (hg_g1k_v37 or another as a Custom genome). It covers the details of that genome, plus format, and chromosome order when using a CG or linking in other annotaion data: Fasta Format, Custom Genomes, and GATK Chromosome ordering
If you just want to work on one chromosome, then you can use the full ROD dataset but only include the VCF containing the variant calls for the chromosome of interest. Or you can filter both down. Use the tool VCF Filter.
Thanks! Jen, Galaxy team