On overabundant words and their application to biological sequence analysis (Q2326388)

From MaRDI portal
scientific article
Language Label Description Also known as
English
On overabundant words and their application to biological sequence analysis
scientific article

    Statements

    On overabundant words and their application to biological sequence analysis (English)
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    7 October 2019
    0 references
    The authors use the biologically justified model introduced by \textit{V. Brendel} et al. [``Linguistics of nucleotide sequences: morphology and comparison of vocabularies'', J. Biomol. Struct. Dyn. 4, No. 1, 11--21 (1986; \url{doi:10.1080/07391102.1986.10507643})]. They prove non-trivial combinatorial properties and establish that it admits efficient computation for overabundant words as well. The results in this article extend their former study by presenting an \(O(n)\)-time and \(O(n)\)-space algorithm for computing all overabundant words in a sequence \(x\) of length \(n\) over an integer alphabet. They also present experimental results, using both synthetic and real data, which further highlight the effectiveness of this model. The computational problem can be described as follows. Given a sequence \(x\) of length \(n\) and a real number \(\rho >0\), compute the set of \(\rho\)-overabundant words, i.e., all words w for which \(dev( w)\ge \rho\), by presenting an \(O(n)\)-time and \(O(n)\)-space algorithm for computing all \(\rho\)-overabundant words (of any length) in a sequence \(x\) of length \(n\) over an integer alphabet. This result is based on a combinatorial property of the suffix tree \(T\) of \(x\) that are proved here: the number of distinct factors of \(x\) whose longest infix is the label of an explicit node of \(T\) is no more than \(3n-4\). Further is demonstrated that the presented algorithm is time-optimal by proving that \(O(n)\) is a tight upper bound for the number of \(\rho\)-overabundant words.
    0 references
    0 references
    overabundant words
    0 references
    avoided words
    0 references
    pattern matching
    0 references
    suffix tree
    0 references
    DNA sequence analysis
    0 references
    0 references