How To Extract Geneid From Pileup File?

Question: How To Extract Geneid From Pileup File?

5.3 years ago by

Yan He • 240

Yan He • 240 wrote:

Dear galaxy-users, I am working on a project to identify and genotype SNPs in targeted genes. I did some analysis using Galaxy. First, mapping to the genome with Bowtie. Second, identify SNPs using MPileup in SAMtools. When I got the pileup file, the SNP information is in which chromosome and what position. I would like to focus on the SNPs within genes. How could I extract the SNP information for each genes (SNP position, coverage)? Is there a tool in Galaxy to fulfill this? Any help is highly appreciated! Best wishes, Yan

alignment bowtie mpileup samtools bam • 1.3k views

ADD COMMENT • link •

modified 5.3 years ago by Jennifer Hillman Jackson ♦ 25k • written 5.3 years ago by Yan He • 240

5.3 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello, *First option* is the tool " SnpEff Variant effect and annotation". This would require setting up a cloud instance and adding the appropriate annotation to the tool for use with the genome you are working with. See the tool shed for more about SnpEff, or the Main/Test server if you want to try it out - is not set up for very many genomes and quotas on Test are small, as it is not intended for intensive use. http://wiki.galaxyproject.org/Cloud *The second* *option* is to do this in a more step-by-step method, something like: 1 - start with a pileup file (not vcf, so use Generate pileup, or use Mpileup without 'Genotype Likelihood Computation:' 2 - use 'Filter Pileup' and for 'Convert coordinates to intervals?:' choose "yes" 3 - now that the data is in interval format, it can be compared with any other interval (bed, etc.) dataset that is mapped to the same genome to determine overlap using the tools in the 'Operate on Genomic Intervals' tool group. Obtain gene (actually transcript) annotation bed files ('bed' is a stricter form of 'interval' format) from sources under "Get Data". Good choices are UCSC and Biomart for many genomes, in particular because you can select out reference bed files that contain specific regions of transcripts: UTR, Exons, Introns, user-specified regions upstream or down, etc., but other sources may be appropriate depending on your genome and needs. As long as the reference annotation you are using is mapped to the same exact genome, then this will work. Once you have a process, save it in a workflow for future use. *Another great (NEW!) option* includes some tools that are still in beta status on the Test server. You can run it here on very small datasets to see if you like, then decide if moving to a cloud and setting it up there is something you want to do. Called "Naive Variant Detector" and "Variant Annotator", these run on VCF files, and will produce statistics somewhat similar to (but with more detail and a different underlying algorithm than) "Filter Pileup". The result here is not in interval format - it is VCF, but it could converted (use tools in Text Manipulation to create a start/stop) or proceed to SnpEff as is. You had another earlier question about this same analysis - I will include some other advice in that reply, next, Best, Jen Galaxy team -- Jennifer Hillman-Jackson http://galaxyproject.org

ADD COMMENT • link written 5.3 years ago by Jennifer Hillman Jackson ♦ 25k

Similar posts • Search »