I have an MiSeq experiment using 24 indices where in each index I was sequencing 3 PCR amplified loci (275bp) from triploid clones of mouse cell line that was subjected to Cas9/CRISPR digest, i.e. should contain indels. Moreover clones were not picked very cartefully - sometimes there might be acutally 2 or 3 different clones present under one index. Anyhow, in each index for each of 3 loci there should be up to 9 various sequences.
The run was unpaired, 150bp, indels are present 10-20nts from the 3' end of the reads. (The run was actually paired 2x150bp but ther reverese did not worked out, what left me with unpaired sequencing from the forward primer only)
First, smaller problem I encountered was with the mapping - it is hard to tweak the parameters that will map all the reads with various deletion lengths at the same time. There are occasional single-base substitutions, short insertions (up to 3bps), and deletions usually 1-10bp long, but occasionally also 15-30 or even 60 or 105bp long. I managed to set up Bowtie2 to detect subst., inset., and deletions up to 30bp, but detecting those longer ones seems impossible. They either show-up in the mapping as having toward the end of the read numerous substitutions and shorter indels, or are not mapped at all (dumped in unaligned reads file). This might be because there is a very short distance from the indel towards the end of the read (as I mentioned above - 10-20bps). The input parameters for Bowtie2 I use are as follows (sorry, I can't find where the resulting command line is):
|Input Parameter||Value||Note for rerun|
|Is this single or paired library||single|
|FASTQ file||48: FASTQ Groomer on data 24|
|Write unaligned reads (in fastq format) to separate file(s)||True|
|Will you select a reference genome from your history or use a built-in index?||indexed|
|Select reference genome||mm10|
|Specify the read group for this file?||no|
|Select analysis mode||full|
|Do you want to tweak input options?||no|
|Do you want to tweak alignment options?||no|
|Do you want to tweak scoring options?||yes|
|Set the match bonus||2|
|Set the maximum (`MX`) and minimum (`MN`) mismatch penalties, both integers||10,8|
|Sets penalty for positions where the read, reference, or both, contain an ambiguous character such as `N`||10|
|Set the read gap opening penalty||10|
|Set the read gap extension penalty||1|
|Set the reference gap opening penalty||10|
|Set the reference gap extension penalty||3|
|Do you want to use -a or -k options||no|
|Do you want to tweak effort options?||no|
|Do you want to tweak SAM/BAM Options?||no|
|Do you want to tweak Other Options?||no|
|Job Resource Parameters||no
I was suggested using BWA-MEM, but it basically cuts off the ends of the reads where the indels start, does not map the parts of reads that are after indels and it is impossible to infer anything about those indels.
Further, I am trying to detect indels in those mappings with FreeBayes. I can see on genomic viewer, that there are 3-6 indels in this site per clone. But FreeBayes detects only those very short, it basically does not detect for instance deletions larger that 10.
The command that I run is as follows:
|"freebayes --bam localbam_0.bam --fasta-reference /galaxy/data/mm10/sam_index/mm10.fa --vcf /galaxy-repl/main/files/012/883/dataset_12883522.dat --region chr4:148446582..148559685 --theta 0.002 --ploidy 3 -K -m 1 -q 0 -R 0 -Y 0 -e 1000 -F 0.2|
I ended up using samtools mpileup to detect all mapped indels and SNPs and writing a python script to extract the frequencies of those indels from resulting textfiles, but having freebayes do this for me would be much more handy.
I would appreciate any suggestions how could I improve mapping quality and freebayes indel/SNP frequency detection in my data.
Waiting for your reply,