Toward a Definitive Compressibility Measure for Repetitive Sequences
From MaRDI portal
Publication:6153652
DOI10.1109/TIT.2022.3224382arXiv1910.02151OpenAlexW3125235775MaRDI QIDQ6153652FDOQ6153652
Authors: Tomasz Kociumaka, Gonzalo Navarro, Nicola Prezza
Publication date: 19 March 2024
Published in: IEEE Transactions on Information Theory (Search for Journal in Brave)
Abstract: Unlike in statistical compression, where Shannon's entropy is a definitive lower bound, no such clear measure exists for the compressibility of repetitive sequences. Since statistical entropy does not capture repetitiveness, ad-hoc measures like the size of the Lempel--Ziv parse are frequently used to estimate it. The size of the smallest bidirectional macro scheme captures better what can be achieved via copy-paste processes, though it is NP-complete to compute and it is not monotonic upon symbol appends. Recently, a more principled measure, the size of the smallest string emph{attractor}, was introduced. The measure lower bounds all the previous relevant ones, yet length- strings can be represented and efficiently indexed within space , which also upper bounds most measures. While is certainly a better measure of repetitiveness than , it is also NP-complete to compute and not monotonic, and it is unknown if one can always represent a string in space. In this paper, we study an even smaller measure, , which can be computed in linear time, is monotonic, and allows encoding every string in space because . We show that better captures the compressibility of repetitive strings. Concretely, we show that (1) can be strictly smaller than , by up to a logarithmic factor; (2) there are string families needing space to be encoded, so this space is optimal for every and ; (3) one can build run-length context-free grammars of size , whereas the smallest (non-run-length) grammar can be up to times larger; and (4) within space we can not only...
Full work available at URL: https://arxiv.org/abs/1910.02151
Cited In (9)
- New string attractor-based complexities for infinite words
- Internal pattern matching queries in a text and applications
- String attractors of some simple-parry automatic sequences
- Compressibility measures for two-dimensional data
- Frequency-constrained substring complexity
- Non-overlapping indexing in BWT-runs bounded space
- Iterated straight-line programs
- Space-efficient conversions from SLPs
- Near-optimal search time in \(\delta \)-optimal space, and vice versa
This page was built for publication: Toward a Definitive Compressibility Measure for Repetitive Sequences
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q6153652)