Question: Pileup consensus not called correctly
0
gravatar for sprosser
4.0 years ago by
sprosser10
Canada
sprosser10 wrote:

Hi,

I want to extract a consensus sequence from an aligned BAM file, which was generated by aligning Ion Torrent reads to a reference.  The reference sequence was a close, but not exact, match to my sample.  Working solely on the Galaxy public web interface, I used pileup with 10 column format with plans to subsequently cut out column 4 (alignment consensus) for further analysis.  However, when I inspect the pileup output, I noticed that the alignment consensus is identical to the reference sequence, even though the aligned reads have a different base at certain positions.  I'm new to NGS analysis so hopefully this is an obvious easy fix to someone more familiar than me.  Below is an example of my pileup output.  Note positions 38, 40, and 43, which should be A, T, and A right?  Not G, A, and G? Also, certain positions (i.e. 41) seem to have duplicate lines.  Is that normal? Thanks!

 

Milocera_horaria_REF    37    A    A    212    0    60    8019    ...............................
Milocera_horaria_REF    38    G    G    48    0    60    8019    AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Milocera_horaria_REF    39    T    T    39    0    60    8019    ...............................
Milocera_horaria_REF
Milocera_horaria_REF    40    A    A    39    0    60    8019    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
Milocera_horaria_REF    41    G    G    184    0    60    8019    ...............................
Milocera_horaria_REF    41    G    G    184    0    60    8019    ...............................
Milocera_horaria_REF
Milocera_horaria_REF    42    G    G    114    0    60    8019    ...............................
Milocera_horaria_REF
Milocera_horaria_REF    43    G    G    102    0    60    8019    AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Milocera_horaria_REF
Milocera_horaria_REF    44    A    A    255    0    60    8019    ...............................
Milocera_horaria_REF
Milocera_horaria_REF    45    C    C    255    0    60    8019    ...............................

 

galaxy samtools • 1.6k views
ADD COMMENTlink modified 4.0 years ago • written 4.0 years ago by sprosser10
0
gravatar for Jennifer Hillman Jackson
4.0 years ago by
United States
Jennifer Hillman Jackson25k wrote:

Hello,

More about this format is here:
http://samtools.sourceforge.net/cns0.shtml

I am wondering if there is a one-off base counting issue when you are inspecting the data. Please try viewing in a browser (Trackster would be one choice, see Visualizations). Assign the custom reference genome as a custom build 'database' to the dataset first. Create it under 'User -> Custom Builds'.

Best, Jen, Galaxy team 

ADD COMMENTlink written 4.0 years ago by Jennifer Hillman Jackson25k

Hi Jen,

Thanks for the prompt response!  I was able to visualize the mapped reads in Trackster but I'm stuck because I'm not sure what you mean by a "one-off base counting issue".  Do you mean my reference or consensus sequences are shifted by a single base or something like that?   I'm not entirely sure what I should be looking for, but my reads cover then entire reference sequence is that's what you wanted to know.  One possible point of interest is that my reference sequence is 1 nt shorter than it should be according to the size stated in Trackster, but I'm not sure if that's significant.

Also, I didn't filter my BAM file prior to pileup, could that be causing issues?  For example, could un-mapped reads be getting through into the final pileup dataset and throwing off the consensus calling? 

Thanks,

Sean

ADD REPLYlink written 4.0 years ago by sprosser10

Hi again Jen,

I've been doing a lot of digging and could the fact that the SNP quality at all positions is 0 cause the consensus algorithm to not call the SNP?  If so, that would explain why the reference sequence is being used as the consensus.  In that case, any idea why the SNP quality would be 0?  My reads are probably full of indels (sequencing artefacts) which, from what I can tell, might affect the quality scoring of nearby SNPs, but the SNP's themselves still seem to be of sufficient quality to be included in the consensus.  I know from Sanger data that the SNP's are real.

Thanks,

Sean

ADD REPLYlink written 4.0 years ago by sprosser10

Hello,

I would try three things:

1. Double check that the data is in fastqsanger format (and that is assigned as the datatype) and review the quality score stats. http://wiki.galaxyproject.org/Support#FASTQ_Datatype_QA Run the tool FastQC and adjust quality score scaling if needed with the tool Fastq Groomer. Consider some of the QC/trimming tools also, as needed. If you manipulate the data content, run FastQC again to see the final stats. As a general guideline, be stricter with QC for variant analysis than with some other pipelines where one wants to maximize/retain sequence length (example: RNA-seq - if the data is good enough to map effectively, that is usually enough).

2. Run this tool with the option 'Whether or not to print the mapping quality as the last column:' set to print out the scores.

3. Make certain the quality scores for true indels (not sequencing errors) are at or above this form option threshold (default is 40, but you can adjust this): 'Phred probability of an indel in sequencing/prep'.

For the reference genome size .. I am not quite sure why Trackster would be reporting one base less for a chromosome size. This is on the public Main Galaxy instance at http://usegalaxy.org ? Make sure you are accounting for the 0-based start position often found in data coordinates. Meaning, a base range that is stated like "chr1  99  100" will be labeled as base "99" in the data, but actually base "100" when visualizing/counting up from the start. Not all data or datatypes have a 0-based start, but you can double check this either from the data source, the tool documentation (including 3rd party - links are on tool forms), or by examining the specification of a particular datatype. Common descriptions are in our wiki http://wiki.galaxyproject.org/Learn/Datatypes, UCSC has many in the FAQ http://genome.ucsc.edu, and a google brings up many other options.

Hopefully some part of this helps to sort out the problem, Jen, Galaxy team

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by Jennifer Hillman Jackson25k

Hi Jen,

I am working solely on the public Main Galaxy instance.  I confirmed my data is in fastsanger format, even after manipulation, and indel quality scores are above the phred probability threshold (though I have no true indels in my data).  Basically I'm trying to align Ion Torrent reads (reads are about 150-250 bp) to a relatively short reference (658 bp of COI) so that I can then extract the consensus sequence and BLAST it for identification purposes.  Maybe there's a better way to go about that but what I've done is: 

1. Import raw reads and 658 bp reference sequence (ref seq in FASTA format).

2. Groom raw reads (I'm not sure if this is necessary since it's already in fastsanger format but I did it to be safe).

3. Trim ends of reads (to remove primers/adapters).

4. Filter out reads below 100 bp.

5. Convert filtered reads from FASTQ to FASTA.

5. Map reads to the reference seq using LASTZ (reference seq was also added as custom database and associated mapped LASTZ output.

6. Convert mapped reads from SAM to BAM. At this point I can visualize the mapped reads and confirm that they span the entire 658 bp reference and contain SNP's relative to the reference.  Now I want to extract the consensus sequence of the mapped reads, so I...

7. Generate consensus pileup of BAM data and reference seq.

8. Cut column 4 (which is the consensus sequence), paste into Excel to transpose into horizontal format, BLAST consensus sequence to identify sequence.

Unfortunately I can't get past step #7 because the consensus won't call the SNP's, but rather uses the reference.  The SNP quality (column 6) is 0 for all positions, so maybe that's the issue casing pileup to not call the SNP's?  I'm not sure why it's 0 but I'll keep digging and post if I find a solution.  If you see any major flaws in my workflow (or know of a better way to accomplish my task) I'm all ears!  Thanks again for your help.

-Sean

ADD REPLYlink written 4.0 years ago by sprosser10

Hi Jen,

I think the SNP's are not being called because the quality scores are all 0.  I suspect they're all 0 because of the default use of BAQ in pileup.  Mpileup appears to have the option to disable BAQ so I'll try that.  I don't recall seeing mpileup on the Mail Galaxy instance but maybe I just missed it.  If not, I'll try another Galaxy server or run it locally.  I'll post if that solution works for me and close this thread.

-Sean

ADD REPLYlink written 4.0 years ago by sprosser10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 169 users visited in the last hour