Question: Bowtie2/FreeBayes/mpileup variant detection on NGS of PCR amplicons around Cas9/CRISPR indels
1
gravatar for btarkowski
3.1 years ago by
btarkowski10
Poland
btarkowski10 wrote:

Hey threre,

I have an MiSeq experiment using 24 indices where in each index I was sequencing 3 PCR amplified loci (275bp) from triploid clones of mouse cell line that was subjected to Cas9/CRISPR digest, i.e. should contain indels. Moreover clones were not picked very cartefully - sometimes there might be acutally 2 or 3 different clones present under one index. Anyhow, in each index for each of 3 loci there should be up to 9 various sequences.

The run was unpaired, 150bp, indels are present 10-20nts from the 3' end of the reads. (The run was actually paired 2x150bp but ther reverese did not worked out, what left me with unpaired sequencing from the forward primer only)

First, smaller problem I encountered was with the mapping - it is hard to tweak the parameters that will map all the reads with various deletion lengths at the same time. There are occasional single-base substitutions, short insertions (up to 3bps), and deletions usually 1-10bp long, but occasionally also 15-30 or even 60 or 105bp long. I managed to set up Bowtie2 to detect subst., inset., and deletions up to 30bp, but detecting those longer ones seems impossible. They either show-up in the mapping as having toward the end of the read numerous substitutions and shorter indels, or are not mapped at all (dumped in unaligned reads file). This might be because there is a very short distance from the indel towards the end of the read (as I mentioned above - 10-20bps). The input parameters for Bowtie2 I use are as follows (sorry, I can't find where the resulting command line is):

Input Parameter Value Note for rerun
Is this single or paired library single  
FASTQ file 48: FASTQ Groomer on data 24  
Write unaligned reads (in fastq format) to separate file(s) True  
Will you select a reference genome from your history or use a built-in index? indexed  
Select reference genome mm10  
Specify the read group for this file? no  
Select analysis mode full  
Do you want to tweak input options? no  
Do you want to tweak alignment options? no  
Do you want to tweak scoring options? yes  
Set the match bonus 2  
Set the maximum (`MX`) and minimum (`MN`) mismatch penalties, both integers 10,8  
Sets penalty for positions where the read, reference, or both, contain an ambiguous character such as `N` 10  
Set the read gap opening penalty 10  
Set the read gap extension penalty 1  
Set the reference gap opening penalty 10  
Set the reference gap extension penalty 3  
Do you want to use -a or -k options no  
Do you want to tweak effort options? no  
Do you want to tweak SAM/BAM Options? no  
Do you want to tweak Other Options? no  
Job Resource Parameters no
 

I was suggested using BWA-MEM, but it basically cuts off the ends of the reads where the indels start, does not map the parts of reads that are after indels and it is impossible to infer anything about those indels.

Further, I am trying to detect indels in those mappings with FreeBayes. I can see on genomic viewer, that there are 3-6 indels in this site per clone. But FreeBayes detects only those very short, it basically does not detect for instance deletions larger that 10.

The command that I run is as follows:

"freebayes --bam localbam_0.bam --fasta-reference /galaxy/data/mm10/sam_index/mm10.fa --vcf /galaxy-repl/main/files/012/883/dataset_12883522.dat --region chr4:148446582..148559685 --theta 0.002 --ploidy 3 -K -m 1 -q 0 -R 0 -Y 0 -e 1000 -F 0.2

I ended up using samtools mpileup to detect all mapped indels and SNPs and writing a python script to extract the frequencies of those indels from resulting textfiles, but having freebayes do this for me would be much more handy.

I would appreciate any suggestions how could I improve mapping quality and freebayes indel/SNP frequency detection in my data.

Waiting for your reply,

BartekT

 

ADD COMMENTlink modified 3.1 years ago by Guy Reeves1.0k • written 3.1 years ago by btarkowski10

Have you tried to use a spliced-aligner like HISAT or TopHat, if you have such long reads?

ADD REPLYlink written 3.1 years ago by Bjoern Gruening5.1k

Dear Bjorn,

thanks for your input and sorry for the delay with my reply. I tried TopHat - with standard setting it maps approx. same fraction of reads (~30%), but maybe one should tweak the run parameters. So far I unfortunately had no time to do this, but I'll keep in mind your suggestion for the future.

Thanks again,

BartekT

ADD REPLYlink written 3.0 years ago by btarkowski10
1
gravatar for Guy Reeves
3.1 years ago by
Guy Reeves1.0k
Germany
Guy Reeves1.0k wrote:

Hi 

I suspect that you will need to make a two step approach, no program or setting is going to capture  both indels  <30bp and also those > 50bp.  

So you could

1look for the bigger ones first maybe using pindel  (https://testtoolshed.g2.bx.psu.edu/repository?repository_id=6e858db1dc5f6b28&changeset_revision=b5ae2adce876) and the standard reference.  

2Then try an find the smaller ones in the manner you are doing now.

If you wanted to be clever you could generate a new reference from the pindel step  (which includes the indels) and use this in the second step (rather than the standard ref).

I should say I have no practical experience of doing this but did chat with somebody about your question with quite a bit of experience.  So think carefully before you spend time on this advice.

Thanks  Guy

ADD COMMENTlink written 3.1 years ago by Guy Reeves1.0k

Dear Guy,

thanks for your suggestion and sorry for replying with such delay. I'll try to apply those to the next dataset that I'll be analysing in the future.

All the best,

BartekT

ADD REPLYlink written 3.0 years ago by btarkowski10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 169 users visited in the last hour