Bowtie2/FreeBayes/mpileup variant detection on NGS of PCR amplicons around Cas9/CRISPR indels

Question: Bowtie2/FreeBayes/mpileup variant detection on NGS of PCR amplicons around Cas9/CRISPR indels

3.1 years ago by

Poland

btarkowski • 10 wrote:

Hey threre,

I have an MiSeq experiment using 24 indices where in each index I was sequencing 3 PCR amplified loci (275bp) from triploid clones of mouse cell line that was subjected to Cas9/CRISPR digest, i.e. should contain indels. Moreover clones were not picked very cartefully - sometimes there might be acutally 2 or 3 different clones present under one index. Anyhow, in each index for each of 3 loci there should be up to 9 various sequences.

The run was unpaired, 150bp, indels are present 10-20nts from the 3' end of the reads. (The run was actually paired 2x150bp but ther reverese did not worked out, what left me with unpaired sequencing from the forward primer only)

First, smaller problem I encountered was with the mapping - it is hard to tweak the parameters that will map all the reads with various deletion lengths at the same time. There are occasional single-base substitutions, short insertions (up to 3bps), and deletions usually 1-10bp long, but occasionally also 15-30 or even 60 or 105bp long. I managed to set up Bowtie2 to detect subst., inset., and deletions up to 30bp, but detecting those longer ones seems impossible. They either show-up in the mapping as having toward the end of the read numerous substitutions and shorter indels, or are not mapped at all (dumped in unaligned reads file). This might be because there is a very short distance from the indel towards the end of the read (as I mentioned above - 10-20bps). The input parameters for Bowtie2 I use are as follows (sorry, I can't find where the resulting command line is):

Input Parameter	Value	Note for rerun
Is this single or paired library	single
FASTQ file	48: FASTQ Groomer on data 24
Write unaligned reads (in fastq format) to separate file(s)	True
Will you select a reference genome from your history or use a built-in index?	indexed
Select reference genome	mm10
Specify the read group for this file?	no
Select analysis mode	full
Do you want to tweak input options?	no
Do you want to tweak alignment options?	no
Do you want to tweak scoring options?	yes
Set the match bonus	2
Set the maximum (`MX`) and minimum (`MN`) mismatch penalties, both integers	10,8
Sets penalty for positions where the read, reference, or both, contain an ambiguous character such as `N`	10
Set the read gap opening penalty	10
Set the read gap extension penalty	1
Set the reference gap opening penalty	10
Set the reference gap extension penalty	3
Do you want to use -a or -k options	no
Do you want to tweak effort options?	no
Do you want to tweak SAM/BAM Options?	no
Do you want to tweak Other Options?	no
Job Resource Parameters	no

I was suggested using BWA-MEM, but it basically cuts off the ends of the reads where the indels start, does not map the parts of reads that are after indels and it is impossible to infer anything about those indels.

Further, I am trying to detect indels in those mappings with FreeBayes. I can see on genomic viewer, that there are 3-6 indels in this site per clone. But FreeBayes detects only those very short, it basically does not detect for instance deletions larger that 10.

The command that I run is as follows:

"freebayes --bam localbam_0.bam --fasta-reference /galaxy/data/mm10/sam_index/mm10.fa --vcf /galaxy-repl/main/files/012/883/dataset_12883522.dat --region chr4:148446582..148559685 --theta 0.002 --ploidy 3 -K -m 1 -q 0 -R 0 -Y 0 -e 1000 -F 0.2

I ended up using samtools mpileup to detect all mapped indels and SNPs and writing a python script to extract the frequencies of those indels from resulting textfiles, but having freebayes do this for me would be much more handy.

I would appreciate any suggestions how could I improve mapping quality and freebayes indel/SNP frequency detection in my data.

Waiting for your reply,

BartekT

crispr cas9 indels bowtie2 freebayes • 2.2k views

ADD COMMENT • link •

modified 3.1 years ago by Guy Reeves • 1.0k • written 3.1 years ago by btarkowski • 10

Have you tried to use a spliced-aligner like HISAT or TopHat, if you have such long reads?

ADD REPLY • link written 3.1 years ago by Bjoern Gruening ♦ 5.1k

Dear Bjorn,

thanks for your input and sorry for the delay with my reply. I tried TopHat - with standard setting it maps approx. same fraction of reads (~30%), but maybe one should tweak the run parameters. So far I unfortunately had no time to do this, but I'll keep in mind your suggestion for the future.

Thanks again,

BartekT

ADD REPLY • link written 3.0 years ago by btarkowski • 10

3.1 years ago by

Guy Reeves • 1.0k

Germany

Guy Reeves • 1.0k wrote:

I suspect that you will need to make a two step approach, no program or setting is going to capture both indels <30bp and also those > 50bp.

So you could

1look for the bigger ones first maybe using pindel (https://testtoolshed.g2.bx.psu.edu/repository?repository_id=6e858db1dc5f6b28&changeset_revision=b5ae2adce876) and the standard reference.

2Then try an find the smaller ones in the manner you are doing now.

If you wanted to be clever you could generate a new reference from the pindel step (which includes the indels) and use this in the second step (rather than the standard ref).

I should say I have no practical experience of doing this but did chat with somebody about your question with quite a bit of experience. So think carefully before you spend time on this advice.

Thanks Guy

ADD COMMENT • link written 3.1 years ago by Guy Reeves • 1.0k

Dear Guy,

thanks for your suggestion and sorry for replying with such delay. I'll try to apply those to the next dataset that I'll be analysing in the future.

All the best,

BartekT

ADD REPLY • link written 3.0 years ago by btarkowski • 10

Similar posts • Search »