Statistically consistent and computationally efficient inference of ancestral DNA sequences in the TKF91 model under dense taxon sampling (Q2299336)

From MaRDI portal
scientific article
Language Label Description Also known as
English
Statistically consistent and computationally efficient inference of ancestral DNA sequences in the TKF91 model under dense taxon sampling
scientific article

    Statements

    Statistically consistent and computationally efficient inference of ancestral DNA sequences in the TKF91 model under dense taxon sampling (English)
    0 references
    0 references
    0 references
    21 February 2020
    0 references
    In the present paper, the authors are interested in statistically consistent estimators for the ASR problem under the TKF91 process in the taxon-rich setting, which differs from the ``solvability'' results in [\textit{A. Andoni} et al., Stochastic Processes Appl. 122, No. 12, 3852--3874 (2012; Zbl 1250.92034)]. In fact, an ASR statistical consistency result in this context is already implied by the general results of [\textit{W.-T. Fan} and \textit{S. Roch}, Electron. J. Probab. 23, Paper No. 47, 24 p. (2018; Zbl 1410.60074)]. More concrete they are considered the ancestral sequence reconstruction (ASR) problem in the taxon-rich context for the TKF91 process. It has been known from previous work [Zbl 1410.60074, Theorem 1] that the Big Bang condition is necessary for the existence of consistent estimators. In this paper, the authors design the first estimator which is not only consistent but also explicit and computationally tractable. They ancestral reconstruction algorithm involves two steps: first is estimated the length of the ancestral sequence and then are estimated the nucleotides conditioned on the sequence length. The novel observation that leads to the design of authors estimator is a new constructive proof of initial-state identifiability, formulated in Lemma 2, which says that one can explicitly invert the mapping from the root sequence to the distribution of the leaf sequences. This is nontrivial for evolutionary models with indels. This estimator is computationally efficient in the sense that the number of arithmetic operations required scales like a polynomial in the size of the input data. Indeed the length estimator is linear in the number of input sequences and the matrix manipulations in the sequence estimator are polynomial in the length of the longest input sequence.
    0 references
    phylogenetics
    0 references
    ancestral reconstruction
    0 references
    insertion/deletions
    0 references

    Identifiers