Universal compressed text indexing
From MaRDI portal
Abstract: The rise of repetitive datasets has lately generated a lot of interest in compressed self-indexes based on dictionary compression, a rich and heterogeneous family that exploits text repetitions in different ways. For each such compression scheme, several different indexing solutions have been proposed in the last two decades. To date, the fastest indexes for repetitive texts are based on the run-length compressed Burrows-Wheeler transform and on the Compact Directed Acyclic Word Graph. The most space-efficient indexes, on the other hand, are based on the Lempel-Ziv parsing and on grammar compression. Indexes for more universal schemes such as collage systems and macro schemes have not yet been proposed. Very recently, Kempa and Prezza [STOC 2018] showed that all dictionary compressors can be interpreted as approximation algorithms for the smallest string attractor, that is, a set of text positions capturing all distinct substrings. Starting from this observation, in this paper we develop the first universal compressed self-index, that is, the first indexing data structure based on string attractors, which can therefore be built on top of any dictionary-compressed text representation. Let be the size of a string attractor for a text of length . Our index takes words of space and supports locating the occurrences of any pattern of length in time, for any constant . This is, in particular, the first index for general macro schemes and collage systems. Our result shows that the relation between indexing and compression is much deeper than what was previously thought: the simple property standing at the core of all dictionary compressors is sufficient to support fast indexed queries.
Recommendations
Cites work
- scientific article; zbMATH DE number 6850405 (Why is no real title available?)
- scientific article; zbMATH DE number 2119724 (Why is no real title available?)
- A \textit{really} simple approximation of smallest grammar
- A faster grammar-based self-index
- A fully linear-time approximation algorithm for grammar-based compression
- A self-index on block trees
- Access, rank, and select in grammar-compressed strings
- Application of Lempel-Ziv factorization to the approximation of grammar-based compression.
- Approximation of grammar-based compression via recompression
- At the roots of dictionary compression: string attractors
- Collage system: A unifying framework for compressed pattern matching.
- Complete inverted files for efficient text retrieval and analysis
- Composite repetition-aware data structures
- Constructing Efficient Dictionaries in Close to Sorting Time
- Construction of Aho Corasick automaton in linear time for integer alphabets
- Data compression via textual substitution
- Deterministic sorting in O(nloglogn) time and linear space
- Efficient randomized pattern-matching algorithms
- Efficient string matching
- Examining Computational Geometry, Van Emde Boas Trees, and Hashing from the Perspective of the Fusion Tree
- Fast label extraction in the CDAWG
- Fast prefix search in little space, with applications
- Fully dynamic data structure for LCE queries in compressed space
- Grammar-based codes: a new class of universal lossless source codes
- Indexing compressed text
- LZ77-based self-indexing with faster pattern matching
- Large alphabets and incompressibility
- Linear work suffix array construction
- Longest common extensions in sublinear space
- On compressing and indexing repetitive sequences
- On the Complexity of Finite Sequences
- On the approximation ratio of Lempel-Ziv parsing
- Orthogonal range searching on the RAM, revisited
- Range predecessor and Lempel-Ziv parsing
- Self-indexed grammar-based compression
- Space-efficient preprocessing schemes for range minimum queries on static arrays
- Sparse suffix tree construction in optimal time and space
- Suffix Arrays: A New Method for On-Line String Searches
- The Smallest Grammar Problem
- Time-space trade-offs for Lempel-Ziv compressed indexing
- Time-space trade-offs for predecessor search
Cited in
(16)- Block trees
- Document listing on repetitive collections with guaranteed performance
- Grammar-compressed indexes with logarithmic search time
- Optimal rank and select queries on dictionary-compressed text
- A compressed dynamic self-index for highly repetitive text collections
- Top tree compression of tries
- Substring complexities on run-length compressed strings
- Sensitivity of string compressors and repetitiveness measures
- Compressed text indexing with wildcards
- String attractors of some simple-parry automatic sequences
- A combinatorial view on string attractors
- A simple grammar-based index for finding approximately longest common substrings
- Compressibility measures for two-dimensional data
- Space-efficient conversions from SLPs
- Sparse suffix and LCP array: simple, direct, small, and fast
- Wheeler maps
This page was built for publication: Universal compressed text indexing
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q1729689)