Learning nonsingular phylogenies and hidden Markov models (Q5901107)

The aim of the present paper is to study and to provide new interesting results on the learning problem for phylogenies and hidden Markov models. Phylogenies are used in evolutionary biology to model the stochastic evolution of genetic data on the ancestral tree relating a group of species. The underlying assumption is that genetic information evolves from the root to the leaves according to a Markov model on the tree. In molecular biology, the problem of reconstructing the phylogenetic trees is to obtain the topology of the (unknown) tree from the characters (sequences) at the tree leaves (extant species). The present paper pays a special attention to the problem of inferring the complete Markov model, i.e. inferring the Markov matrices on the edges. The authors looked for the design of efficient reconstruction algorithms of phylogenies, based on learning techniques. A Markov model is called nonsingular is all its transition matrices have determinants bounded away from 0 (and 1). The role of the nonsingularity condition is pointed out for the learning approach. In learning theory, the source of hardness for tree topologies and hidden Markov models are the transition matrices \(P\) for which \(\det(P)\) is 0 (or close to 0), but \(\text{rank}(P) > 1\). When the number of character states is \(k = 2,\) there are no matrices \(P\) whose determinant is 0 and rank is more than 1, thus efficient reconstruction algorithms of phylogenies and hidden Markov models do exist for this case. The main contribution of this paper is to investigate the problem of learning hidden Markov models without the nonsingularity condition, for \(k > 2\), and proper polynomial-time (PAC) algorithm for learning nonsingular phylogenies and hidden Markov models under the weaker condition that \(1 / \text{poly}(n)< |\det(P)| \leq 1\). The obtained class of learning algorithms is based on a combination of techniques from phylogeny, statistics, combinatorics, and probability. Several known results are shown to be particular or closely related cases, including the recovery of the mutation matrices from an infinite number of samples. The reconstruction of the mutation matrices from a polynomial number of samples is proved to require a refined error analysis along with various combinatorial and algorithmic ideas.

0 references

reviewed by

Neculai Curteanu

0 references

zbMATH Keywords

efficient reconstruction algorithms

0 references

problem of reconstructing the phylogenetic trees

0 references

hidden Markov models

0 references

evolutionary trees