Algebraic tools for evolutionary biology (Q1950028)

From MaRDI portal
scientific article
Language Label Description Also known as
English
Algebraic tools for evolutionary biology
scientific article

    Statements

    Algebraic tools for evolutionary biology (English)
    0 references
    23 May 2013
    0 references
    The purpose of this article is to provide an overview of the uses of algebraic statistics and algebraic geometry in studying problems arising in evolutionary biology, particularly phylogenitics. The aim of phylogenetics is to `reconstruct the ancestral relations among species, i.e., the phylogenetic tree, from given DNA sequences.' A phylogenetic tree is an acyclic connected graph \(T\) with the data of: (1) a labelling on the leaves of \(T\) by contemporary species names (recorded as a DNA sequence) and (2) a length for every edge, called a branch length, representing the evolutionary distance between both ends of the edge. The topology of a phylogenetic tree is the topology of the labelled tree (disregarding branch lengths). Hence a phylogenetic tree is determined by its topology and branch lengths. Given a group of contemporary species, there are three key questions to answer when reconstructing the phylogenetic tree: (1) What evolutionary model is best suited to the given DNA sequences? (2) What tree topology best fits the data and what are its branch lengths? (3) Given the choice of model, is it possible to identify the phylogenetic tree from the given DNA sequences? Evolutionary models are presented in \S~2 as discrete-time Markov processes on trees. Choosing a model (question one above) amounts to placing restrictions on the general Markov model [\textit{D. Barry} and \textit{J. A. Hartigan}, Biometrics 43, 261--276 (1987; Zbl 0622.92012)]; [\textit{M. Steel}, Appl. Math. Lett. 7, No. 2, 19--23 (1994; Zbl 0794.60071)]. For instance, in this way one recovers models such as the Strand symmetric model [\textit{M. Casanellas} and \textit{S. Sullivant}, ``The Strand Symmetric Model,'' in: L. Pachter and B. Sturmfels, editors, Algebraic statistics for computational biology, chapter 16, Cambridge University Press (2005)], the Kimura \(3\)-parameter model [\textit{M. Kimura}, Proc. Natl. Acad. Sci. USA 78, 454--458 (1981; Zbl 0511.92013)], and the Jukes-Cantor model [\textit{T. H. Jukes} and \textit{C.R. Cantor}, ``Evolution of protein molecules,'' In Mammalian Protein Matabolism, 21--132 (1969)]. Given a choice of evolutionary model, the author describes in \S~4 three methods which are used to reconstruct phylogenetic trees which are `best' given the data and the choice of model. The first of these is maximum likelihood estimate. Under this method, the maximum likelihood estimate of the parameters of the model \(\mathcal{M}\) is obtained for each choice of tree topology \(T\), and then a tree topology \(T\) is chosen to maximize likelihood among all tree topologies. The drawbacks of this method are based on failure of numerical methods to find a global maximum and the vast number of tree topologies which need to be checked (for \(n\) leaves there are \((2n-5)!!\) tree topologies). The second method is called neighbor-joining, which is a distance-based method. This method reconstructs the phylogenetic tree based on a dissimilarity function \(d\) which is chosen based on the model \(\mathcal{M}\). The tree \(T\) constructed by this method is such that \(d\) mimicks the path length function between leaves of \(T\). There is no need to search through all tree topologies so this is a fast algorithm. The drawback is that evolutionary distance often does not have an interpretation as path distance in the phylogenetic tree, so the constructed tree may be biological nonsense. The third method of reconstruction is based on invariants. It is this method of reconstruction where techniques from commutative algebra and algebraic geometry are most useful. The author describes invariants, and in particular phylogenetic invariants, in \S~3. Let \(\mathcal{M}\) be a model with \(d\) free parameters on the tree topology \(T\), where \(T\) has \(n\) leaves. There is a map \(\phi^\mathcal{M}_T: \mathbb{R}^d\rightarrow \mathbb{R}^{4^n}\), where each vector of parameters is sent to a vector whose entries record the probabilities of a given DNA sequence occurring on the leaves of \(T\). Any polynomial in the ideal \(I_\mathcal{M}(T)\) of polynomials vanishing on the image of \(\phi^\mathcal{M}_T\) is called an invariant of \(T\). These invariants record relations on the probabilities for the model \(\mathcal{M}\). If a polynomial \(f\) is in \(I_\mathcal{M}(T)\) but not in \(I_\mathcal{M}(T')\) for any other tree topology \(T'\), then \(f\) is a phylogenetic invariant. The invariants of \(T\) may also be used to reconstruct the phylogenetic tree. Such methods may be found in [\textit{M. Casanellas} and \textit{J. Fernandez-Sanchez}, ``Performance of a new invariants method on homogeneous and nonhomogeneous quartet trees,'' Mol. Biol. Evol. 24, No. 1, 288--293 (2007), ``Reconstrucción filogenética usando geometría algebraica,'' Arbor. Ciencia, pensamiento, cultura 96, 207--229 (2010); \textit{N. Eriksson}, ``Tree construction using singular value decomposition,'' in: Algebraic Statistics for Computational Biology, chapter 19, Cambridge University Press (2005)]. While the results obtained using invariants are slightly worse than those obtained using maximum likelihood or neighboor joining, the author notes that these methods apply to much more general evolutionary models, in particular they work well on non-homogeneous models.
    0 references
    algebraic statistics
    0 references
    phylogenetics
    0 references
    phylogenetic tree
    0 references
    phylogenetic invariant
    0 references
    algebraic variety
    0 references
    Markov model
    0 references
    Markov process
    0 references
    maximum likelihood
    0 references
    neighbor-joining
    0 references
    0 references

    Identifiers

    0 references
    0 references
    0 references
    0 references
    0 references