The complexity of the co-occurrence problem

From MaRDI portal
Publication:6166971

DOI10.1007/978-3-031-20643-6_4zbMATH Open1525.68034arXiv2206.10383OpenAlexW4312888971MaRDI QIDQ6166971FDOQ6166971

Tord Stordalen, Inge Li GΓΈrtz, Philip Bille

Publication date: 4 August 2023

Published in: String Processing and Information Retrieval (Search for Journal in Brave)

Abstract: Let S be a string of length n over an alphabet Sigma and let Q be a subset of Sigma of size qgeq2. The 'co-occurrence problem' is to construct a compact data structure that supports the following query: given an integer w return the number of length-w substrings of S that contain each character of Q at least once. This is a natural string problem with applications to, e.g., data mining, natural language processing, and DNA analysis. The state of the art is an O(sqrtnq) space data structure that -- with some minor additions -- supports queries in O(loglogn) time [CPM 2021]. Our contributions are as follows. Firstly, we analyze the problem in terms of a new, natural parameter d, giving a simple data structure that uses O(d) space and supports queries in O(loglogn) time. The preprocessing algorithm does a single pass over S, runs in expected O(n) time, and uses O(d) space in addition to the input. Furthermore, we show that O(d) space is optimal and that O(loglogn)-time queries are optimal given optimal space. Secondly, we bound d=O(sqrtnq), giving clean bounds in terms of n and q that match the state of the art. Furthermore, we prove that Omega(sqrtnq) bits of space is necessary in the worst case, meaning that the O(sqrtnq) upper bound is tight to within polylogarithmic factors. All of our results are based on simple and intuitive combinatorial ideas that simplify the state of the art.


Full work available at URL: https://arxiv.org/abs/2206.10383





Cites Work


Cited In (1)


Recommendations





This page was built for publication: The complexity of the co-occurrence problem

Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q6166971)