On overabundant words and their application to biological sequence analysis (Q2326388)
From MaRDI portal
scientific article
Language | Label | Description | Also known as |
---|---|---|---|
English | On overabundant words and their application to biological sequence analysis |
scientific article |
Statements
On overabundant words and their application to biological sequence analysis (English)
0 references
7 October 2019
0 references
The authors use the biologically justified model introduced by \textit{V. Brendel} et al. [``Linguistics of nucleotide sequences: morphology and comparison of vocabularies'', J. Biomol. Struct. Dyn. 4, No. 1, 11--21 (1986; \url{doi:10.1080/07391102.1986.10507643})]. They prove non-trivial combinatorial properties and establish that it admits efficient computation for overabundant words as well. The results in this article extend their former study by presenting an \(O(n)\)-time and \(O(n)\)-space algorithm for computing all overabundant words in a sequence \(x\) of length \(n\) over an integer alphabet. They also present experimental results, using both synthetic and real data, which further highlight the effectiveness of this model. The computational problem can be described as follows. Given a sequence \(x\) of length \(n\) and a real number \(\rho >0\), compute the set of \(\rho\)-overabundant words, i.e., all words w for which \(dev( w)\ge \rho\), by presenting an \(O(n)\)-time and \(O(n)\)-space algorithm for computing all \(\rho\)-overabundant words (of any length) in a sequence \(x\) of length \(n\) over an integer alphabet. This result is based on a combinatorial property of the suffix tree \(T\) of \(x\) that are proved here: the number of distinct factors of \(x\) whose longest infix is the label of an explicit node of \(T\) is no more than \(3n-4\). Further is demonstrated that the presented algorithm is time-optimal by proving that \(O(n)\) is a tight upper bound for the number of \(\rho\)-overabundant words.
0 references
overabundant words
0 references
avoided words
0 references
pattern matching
0 references
suffix tree
0 references
DNA sequence analysis
0 references