Stitching MAF blocks for use with phyloCSF

Question: Stitching MAF blocks for use with phyloCSF

4.3 years ago by

jsj5 • 0

United Kingdom

jsj5 • 0 wrote:

If I've understood correctly - when using Galaxy's stitch MAF blocks tool, the resulting fasta file uses gap characters to represent genomic regions for which no alignment block was present. These gap characters are therefore present in both the original target/reference sequence and the query sequences used to create the multiple alignment.

For phyloCSF to work properly, the reference sequence must be un-gapped (see https://github.com/mlin/PhyloCSF/wiki). There is a phyloCSF option to remove gaps common to all sequences; however, this would cause frame-shifts in the reference sequence that would invalidate the analysis.

Given that people frequently use Stitch MAF blocks to generate multi-fasta files for phyloCSF analysis, is there a convenient way to create fasta files in which the reference sequence does not contain gaps?

I understand that it may be possible to run the underlying galaxy code (e.g. get_spliced_region_alignment()), and customize the resulting fasta, but, given that galaxy is aimed at users without programming experience, I wonder if there is an easier way to overcome this problem?

Thanks in advance.

galaxy • 1.6k views

ADD COMMENT • link •

modified 4.3 years ago by Jennifer Hillman Jackson ♦ 25k • written 4.3 years ago by jsj5 • 0

4.3 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

Unless the species are highly conserved, and each genome is close to finished (versus draft), then gaps are inevitable even among conserved transcript exons. And depending on the interval set you are working from, variation between species is expected at some level, even when closely related/finished. If possible, use "Stitch Gene blocks" to reduce the noise.

It sounds like the Stitch MAF option "Split into Gapless MAF blocks: Yes" would perform about the same operation as removing all gaps (using the other tool). This will introduce unknown frames, and fragmentation, but no gaps. There is an option with the other tool to explore all three frames, where the best can be selected. There is also an option to run the other tool to permit gaps in the reference.

Introducing fewer species in a single run will reduce the frequency of gaps, as these are propagated throughout the MAF if any have an inserted/deleted region (one or more bases). If a particular reference genome is problematic or simply more divergent than the others, it might be worth testing it out as a paired set, leaving it out of grouped analysis.

There are more options for manipulating MAFs and in particular fasta files (such as "Concatenate FASTA alignment by species"), but you may still introduce frame-shifts with these, in particular with the aligned genomes. Stitch Gene blocks -> Gapless MAF blocks -> Fasta -> Concatenate is probably the best place to start, but test.

Others are welcome to post their experiences with this tool, Jen, Galaxy team

ADD COMMENT • link written 4.3 years ago by Jennifer Hillman Jackson ♦ 25k

Similar posts • Search »