On overabundant words and their application to biological sequence analysis (Q2326388)

From MaRDI portal





scientific article; zbMATH DE number 7114316
Language Label Description Also known as
default for all languages
No label defined
    English
    On overabundant words and their application to biological sequence analysis
    scientific article; zbMATH DE number 7114316

      Statements

      On overabundant words and their application to biological sequence analysis (English)
      0 references
      0 references
      0 references
      0 references
      0 references
      0 references
      7 October 2019
      0 references
      The authors use the biologically justified model introduced by \textit{V. Brendel} et al. [``Linguistics of nucleotide sequences: morphology and comparison of vocabularies'', J. Biomol. Struct. Dyn. 4, No. 1, 11--21 (1986; \url{doi:10.1080/07391102.1986.10507643})]. They prove non-trivial combinatorial properties and establish that it admits efficient computation for overabundant words as well. The results in this article extend their former study by presenting an \(O(n)\)-time and \(O(n)\)-space algorithm for computing all overabundant words in a sequence \(x\) of length \(n\) over an integer alphabet. They also present experimental results, using both synthetic and real data, which further highlight the effectiveness of this model. The computational problem can be described as follows. Given a sequence \(x\) of length \(n\) and a real number \(\rho >0\), compute the set of \(\rho\)-overabundant words, i.e., all words w for which \(dev( w)\ge \rho\), by presenting an \(O(n)\)-time and \(O(n)\)-space algorithm for computing all \(\rho\)-overabundant words (of any length) in a sequence \(x\) of length \(n\) over an integer alphabet. This result is based on a combinatorial property of the suffix tree \(T\) of \(x\) that are proved here: the number of distinct factors of \(x\) whose longest infix is the label of an explicit node of \(T\) is no more than \(3n-4\). Further is demonstrated that the presented algorithm is time-optimal by proving that \(O(n)\) is a tight upper bound for the number of \(\rho\)-overabundant words.
      0 references
      overabundant words
      0 references
      avoided words
      0 references
      pattern matching
      0 references
      suffix tree
      0 references
      DNA sequence analysis
      0 references

      Identifiers