Question: Naive Variant Caller
0
gravatar for Suzanne Gomes
3.5 years ago by
Suzanne Gomes120
Canada
Suzanne Gomes120 wrote:

I am using RNA-seq data for two different species and their hybrids. I want to find all positions with fixed differences between the two parent species, so that I can then assess any differences in expression of the two alleles in the hybrids (allelic imbalance), which can be used to infer regulatory evolution between the two parental species.

I'm currently trying to use the Naïve Variant Caller to do this, and I have a few questions about how it works.

1. When I run it, there is nothing in the filter field indicating whether the reads pass the quality filter (even though I set a minimum quality of 20). There’s just a period in the filter field for every position. Is it actually filtering my data or not?

2. I set a minimum number of reads to consider REF/ALT to 10. However it seems to output any position where there is at least one base matching the reference, but doesn’t report alternate alleles at all if there aren’t at least 10. What I really want is for it to only report positions with 10+ supporting bases overall, regardless of how many do or don’t match the reference. 

3. My samples are derived from Drosophila, so they’re diploid, but each sample contains tissues from 30-40 individuals. So I set the ploidy to 2, but theoretically there could be more than two bases present at a given position. What should I do to accommodate this? Set the ploidy to 4, since there are only 4 possible bases?

4. My samples have strand information, so I set it to report counts by strand. In every case, I get the same base reported for the + and - strands. Shouldn’t they be complementary, not identical? I get the same thing when running through mpileup. 

rna-seq snp • 2.0k views
ADD COMMENTlink modified 3.5 years ago by Daniel Blankenberg ♦♦ 1.7k • written 3.5 years ago by Suzanne Gomes120
2
gravatar for Daniel Blankenberg
3.5 years ago by
Daniel Blankenberg ♦♦ 1.7k
United States
Daniel Blankenberg ♦♦ 1.7k wrote:

Hi Suzanne,

Hopefully I can clarify somethings.

1.) Reads that do not pass the configured filter settings are not included in base counts nor used for the determination of alleles or frequencies.

2.) This is the minimum number of reads needed to support a particular base (or indel) at a position for genotype calls and allele frequency.

3.) This really depends on what you are trying to determine. Do you have separate readgroups for each individual or were the individuals sequenced as a single non-barcoded sample? The ploidy affects the genotype call, but you will get counts for each base, regardless.

4.) All bases are reported in reference to the + strand as the VCF format, itself, is positionally based this way. This also simplifies comparisons in e.g. bias between strands (no need to do an extra step to determine complement). Would it be useful to optionally allow the reads that mapped to the reverse complemented sequence be reported as the reverse complement of the sequence? I worry that this may unnecessarily complicate things.

 

Let us know if we can provide additional information.

ADD COMMENTlink written 3.5 years ago by Daniel Blankenberg ♦♦ 1.7k

1. Ok, thanks. That clears it up.

2. I realized after running NVC with the minimum set to both 10 and 1, that every base (that passes filtering) gets reported in the final column, but only those with the minimum number of supporting reads get reported in the 'ALT' column. So that makes more sense now!

3. The 30 individuals were all pooled together in the same RNA extraction, so they all have the same barcode. So I'm still not quite clear on what would happen if, for example, I had the min # supporting reads set to ten, and there were at least 10 reads with C, A, and T at a position with reference base = G. If ploidy is set to 2, would it still report under the ALT column all three alternative bases?

4. Ok.. so they're all reported as whatever the + strand is. So, if I understand correctly, when I have a column with this: 0/0:::+G=9,-G=15, it means that there are 9 bases that mapped to G on the + strand, and 15 that mapped to the complementary C on the - strand (but those C's are simply reported as G's). If that's the case, then yes, I agree it's much easier to simply leave it that way. I was just worried that it was telling me I had a G on both strands!

Thanks for your answer! 

ADD REPLYlink written 3.5 years ago by Suzanne Gomes120
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 64 users visited in the last hour