Optimal computation of avoided words
From MaRDI portal
Abstract: The deviation of the observed frequency of a word from its expected frequency in a given sequence is used to determine whether or not the word is avoided. This concept is particularly useful in DNA linguistic analysis. The value of the standard deviation of , denoted by , effectively characterises the extent of a word by its edge contrast in the context in which it occurs. A word of length is a -avoided word in if , for a given threshold . Notice that such a word may be completely absent from . Hence computing all such words na"{i}vely can be a very time-consuming procedure, in particular for large . In this article, we propose an -time and -space algorithm to compute all -avoided words of length in a given sequence of length over a fixed-sized alphabet. We also present a time-optimal -time and -space algorithm to compute all -avoided words (of any length) in a sequence of length over an alphabet of size . Furthermore, we provide a tight asymptotic upper bound for the number of -avoided words and the expected length of the longest one. We make available an open-source implementation of our algorithm. Experimental results, using both real and synthetic data, show the efficiency of our implementation.
Recommendations
Cited in
(2)
This page was built for publication: Optimal computation of avoided words
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q1708407)