Optimal computation of overabundant words
From MaRDI portal
Publication:5111793
DOI10.4230/LIPICS.WABI.2017.4zbMATH Open1443.92133arXiv1705.03385OpenAlexW2963163208MaRDI QIDQ5111793FDOQ5111793
Costas S. Iliopoulos, Yannis Almirantis, Dimitris Polychronopoulos, Manal Mohamed, Jia Gao, Solon P. Pissis, Panagiotis Charalampopoulos
Publication date: 27 May 2020
Abstract: The observed frequency of the longest proper prefix, the longest proper suffix, and the longest infix of a word in a given sequence can be used for classifying as avoided or overabundant. The definitions used for the expectation and deviation of in this statistical model were described and biologically justified by Brendel et al. (J Biomol Struct Dyn 1986). We have very recently introduced a time-optimal algorithm for computing all avoided words of a given sequence over an integer alphabet (Algorithms Mol Biol 2017). In this article, we extend this study by presenting an -time and -space algorithm for computing all overabundant words in a sequence of length over an integer alphabet. Our main result is based on a new non-trivial combinatorial property of the suffix tree of : the number of distinct factors of whose longest infix is the label of an explicit node of is no more than . We further show that the presented algorithm is time-optimal by proving that is a tight upper bound for the number of overabundant words. Finally, we present experimental results, using both synthetic and real data, which justify the effectiveness and efficiency of our approach in practical terms.
Full work available at URL: https://arxiv.org/abs/1705.03385
Recommendations
- On overabundant words and their application to biological sequence analysis
- Computing maximal-exponent factors in an overlap-free word
- Optimal computation of avoided words
- Constructing antidictionaries of long texts in output-sensitive space
- Efficient computation of shortest absent words in a genomic sequence
Protein sequences, DNA sequences (92D20) Computational methods for problems pertaining to biology (92-08)
Cites Work
Cited In (2)
This page was built for publication: Optimal computation of overabundant words
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q5111793)