Mapping paired reads to reference genome, variant calling

Question: Mapping paired reads to reference genome, variant calling

15 months ago by

I am trying to map a trio of paired-end Illumina reads to a reference genome, then identify polymorphisms that are present in all three individuals; to do this, I have prepared the datasets using FASTQ groomer, and FASTQ joiner. Here are 3 questions:

1) What are the advantages/disadvantages of mapping reads to a reference genome using Bowtie2, BWA, or BWA-MEM? I see that BWA-MEM is for longer reads, but some of my reads are <100bp but others are >100bp

2) Bowtie2 asks for intelleaved FASTQ files as input; what does this mean, and is the output of FASTQ joiner interleaved?

3) What are the advantages/disadvantages of FreeBayes vs NaiveVariantCaller?

Thanks!

naivevariantcaller bwa bowtie2 freebayes • 1.0k views

ADD COMMENT • link •

modified 15 months ago • written 15 months ago by jrberminghamjr • 10

15 months ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

Questions 1 & 3 are large in scope, but below I give some general advice and resources. If you need help with usage - please let us know.

1 - This Q&A includes a link to a publication and a summary of differences for these tools: https://www.quora.com/How-is-Bowtie-different-from-BWA-BWA-MEM. There are other resources - a google with the tool names will bring up much prior Q&A, summaries, opinions, publications, and various discussions around specific use cases. You might also consider mapping your own data with each to compare the results. Reviewing what others have determined is a useful place to start, yet running a few tests yourself is the ultimately the best way to determine optimal tools/settings that fit your specific data and goals.

2 - Bowtie2 will accept either two distinct paired fastq datasets (forward and reverse) or one interleaved fastq dataset. The select menu allows you to choose the type of input. If your data is already in two datasets, simply enter it in that format. The output from FASTQ Joiner is not the same content as an interleaved fastq dataset.

Joined = the forward and reverse sequence content is joined (merged) into a single sequence and quality string (single fastq record). All sequences that had a matched pair (based on the sequence identifier) will be included in the output from FASTQ Joiner. This would be entered into tools as an unpaired dataset. There are use cases for this type of input - but in general - if you have paired end data, it is best to input it as two matched paired-end datasets.

Interleaved = the forward and reverse sequence content is concatenated (stacked) into a single dataset with the distinct fastq records retained. The forward read will be included (all original 4 fastq lines) followed by the reverse read (all 4 original fastq lines) - for all records. Interleaved fastq data comes from the data source - it is not produced by a Galaxy tool.

3 - These two tools use a different algorithm to make calls and output slightly different VFC content. The NVC tool is capable of outputting all calls, including stranded calls, and the major/minor alleles can be reviewed/filtered using the tool Variant Annotator. More details are here (also linked from the tool wrapper): https://genomebiology.biomedcentral.com/articles/10.1186/gb4161. For more about Freebayes processing details, the manual is the best place to start.

Thanks! Jen, Galaxy team

ADD COMMENT • link modified 15 months ago • written 15 months ago by Jennifer Hillman Jackson ♦ 25k

Similar posts • Search »