Generalized substring compression (Q2437745): Difference between revisions

For the substring-compression problem, we are asked to preprocess a string $S [1..n]$ such that later, given $i$ and $j$ with $1 \leq i \leq j \leq n$, we can quickly return the output of some pre-specified compression algorithm on $S [i..j]$; in this paper the algorithm is LZ77 [\textit{J. Ziv} and \textit{A. Lempel}, IEEE Trans. Inf. Theory 23, 337--343 (1977; Zbl 0379.94010)]. For the generalized substring-compression problem, we are asked to preprocess $S$ such that later, given $i$, $j$, $\alpha$ and $\beta$ with $1 \leq \alpha \leq \beta \leq n$, we can quickly return the parse of $S [i..j]$ that LZ77 would generate when already given $S [\alpha..\beta]$ as a context. That is, we return the suffix of the output of LZ77 on $S [\alpha..\beta] \$ S [i..j]$, where \$ is a special character not occurring in $S$, that is generated while LZ77 is processing $S [i..j]$. \textit{G. Cormode} and \textit{S. Muthukrishnan} [``Substring compression problems'', in: Proceedings of the sixteenth annual ACM-SIAM symposium on discrete algorithms, SODA 2005. New York, NY: Association for Computing Machinery (2005)] introduced and studied these problems. Via a reduction to range reporting, they gave an $O (n \log^\epsilon n)$-space data structure for substring compression with $O (C (i, j) \log n \log \log n)$ query time, where $C (i, j)$ is the number of phrases in the LZ77 parse of $S [i..j]$. They also gave a data structure for generalized substring compression, but it was faulty. In this paper, the authors improve Cormode and Muthukrishnan's result for substring compression, via a reduction to the range-successor problem [\textit{H.-P. Lenhof} and \textit{M. Smid}, RAIRO, Inform. Théor. Appl. 28, No. 1, 25--49 (1994; Zbl 0998.68520)], and give the first correct data structure for generalized substring compression, via a reduction to range emptiness. They achieve various time-space tradeoffs by choosing the data structures for range successor and range emptiness appropriately. For example, they obtain linear-space data structures with $O (C (i, j) \log^\epsilon n)$ and $O \left( C_{\alpha, \beta} (i, j) \log \left( \frac{j - i}{C_{\alpha, \beta} (i, j)} \right) \log^\epsilon n \right)$ query times, respectively, for substring compression and generalized substring compression. Suppose we have already encoded $S [i..k - 1]$ and now we want to find the next phrase in the LZ77 parse of $S [i..j]$. For substring compression, we should find the length of the longest common prefix (LCP) of $S [k..j]$ and any of $S [i..n], \dots, S [k - 1..n]$. To be able to do this quickly, we represent the suffixes as points on a grid: if $S [y..n]$ is lexicographically $x$th among the suffixes of $S$, then we represent it as the point $(x, y)$. Finding the length of the LCP of $S [k..j]$ and any of $S [i..n], \dots, S [k - 1..n]$ that are lexicographically less (or greater) than $S [k..j]$, is equivalent to finding the rightmost (or leftmost) point whose $y$-coordinate is between $i$ and $k - 1$ and whose $x$-coordinate is less (or greater) than that of the point with $y$-coordinate $k$. For generalized substring compression, we should find the length of the LCP of $S [k..j]$ and any of $S [\alpha..\beta], S [\alpha + 1..\beta], \dots, S [\beta], S [i..n], \dots, S [k - 1..n]$. Let $x_{\min}$ and $x_{\max}$ be the lexicographic ranks of the lexicographically smallest and largest suffixes of $S$ that share prefixes of length at least $\ell \leq j - k + 1$ with $S [k..j]$; these can be computed quickly using, e.g., an augmented suffix tree. Determining whether the LCP of $S [k..j]$ and $S [\alpha..\beta], S [\alpha + 1..\beta], \dots, S [\beta]$ is at least $\ell$ is equivalent to determining whether there is a point in the rectangle $[x_{\min}, x_{\max}] \times [\alpha, \beta]$. Thus, generalized substring compression can be solved using a search with a range-emptiness query at each step.

0 references

reviewed by

Travis Gagie

0 references

zbMATH Keywords

data compression

0 references

Lempel-Ziv compression

0 references

suffix tree

0 references

range searching

0 references

describes a project that uses

BEETL

0 references

LZ77

0 references

MaRDI profile type

MaRDI publication profile

0 references

full work available at URL

https://doi.org/10.1016/j.tcs.2013.10.010

0 references

cites work

Text Indexing and Dictionary Matching with One Error

0 references

The level ancestor problem simplified

0 references

Substring Range Reporting

0 references

Orthogonal range searching on the RAM, revisited

0 references

Approximating the smallest grammar

0 references

Q2921685

0 references

Comparing DNA Sequence Collections by Direct Comparison of Compressed Text Indexes

0 references

Computing longest previous factor in linear time and applications

0 references

Q4910720

0 references

Dynamic Text Indexing under String Updates

0 references

Multi-method dispatching

0 references

Fast Algorithms for Finding Nearest Common Ancestors

0 references

Q5390005

0 references

On position restricted substring searching in succinct space

0 references

Linear Time Lempel-Ziv Factorization: Simple, Fast, Small

0 references

Range Non-overlapping Indexing and Successive List Indexing

0 references

Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval

0 references

Using persistent data structures for adding range restrictions to searching problems

0 references

Orthogonal Range Searching for Text Indexing

0 references

Position-Restricted Substring Searching

0 references

A Space-Economical Suffix Tree Construction Algorithm

0 references

Sorted Range Reporting

0 references

Application of Lempel-Ziv factorization to the approximation of grammar-based compression.

0 references

On-line construction of suffix trees

0 references

A universal algorithm for sequential data compression

0 references

Identifiers

zbMATH Open document ID

1295.68112

0 references

DOI

10.1016/j.tcs.2013.10.010

0 references

Mathematics Subject Classification ID

0 references

0 references

0 references

0 references

0 references

0 references

Sitelinks

Mathematics(1 entry)

mardi Publication:2437745

@@ Property / describes a project that uses @@
+LZ77
@@ Property / describes a project that uses: LZ77 / rank @@
+Normal rank
@@ Property / MaRDI profile type @@
+MaRDI publication profile
@@ Property / MaRDI profile type: MaRDI publication profile / rank @@
+Normal rank
@@ Property / full work available at URL @@
+https://doi.org/10.1016/j.tcs.2013.10.010
+Normal rank
@@ Property / OpenAlex ID @@
+W2037700329
@@ Property / OpenAlex ID: W2037700329 / rank @@
+Normal rank
@@ Property / cites work @@
+Text Indexing and Dictionary Matching with One Error
+Normal rank
@@ Property / cites work @@
+The level ancestor problem simplified
@@ Property / cites work: The level ancestor problem simplified / rank @@
+Normal rank
@@ Property / cites work @@
+Substring Range Reporting
@@ Property / cites work: Substring Range Reporting / rank @@
+Normal rank
@@ Property / cites work @@
+Orthogonal range searching on the RAM, revisited
@@ Property / cites work: Orthogonal range searching on the RAM, revisited / rank @@
+Normal rank
@@ Property / cites work @@
+Approximating the smallest grammar
@@ Property / cites work: Approximating the smallest grammar / rank @@
+Normal rank
@@ Property / cites work @@
+Q2921685
@@ Property / cites work: Q2921685 / rank @@
+Normal rank
@@ Property / cites work @@
+Comparing DNA Sequence Collections by Direct Comparison of Compressed Text Indexes
+Normal rank
@@ Property / cites work @@
+Computing longest previous factor in linear time and applications
+Normal rank
@@ Property / cites work @@
+Q4910720
@@ Property / cites work: Q4910720 / rank @@
+Normal rank
@@ Property / cites work @@
+Dynamic Text Indexing under String Updates
@@ Property / cites work: Dynamic Text Indexing under String Updates / rank @@
+Normal rank
@@ Property / cites work @@
+Multi-method dispatching
@@ Property / cites work: Multi-method dispatching / rank @@
+Normal rank
@@ Property / cites work @@
+Fast Algorithms for Finding Nearest Common Ancestors
+Normal rank
@@ Property / cites work @@
+Q5390005
@@ Property / cites work: Q5390005 / rank @@
+Normal rank
@@ Property / cites work @@
+On position restricted substring searching in succinct space
+Normal rank
@@ Property / cites work @@
+Linear Time Lempel-Ziv Factorization: Simple, Fast, Small
+Normal rank
@@ Property / cites work @@
+Range Non-overlapping Indexing and Successive List Indexing
+Normal rank
@@ Property / cites work @@
+Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval
+Normal rank
@@ Property / cites work @@
+Using persistent data structures for adding range restrictions to searching problems
+Normal rank
@@ Property / cites work @@
+Orthogonal Range Searching for Text Indexing
@@ Property / cites work: Orthogonal Range Searching for Text Indexing / rank @@
+Normal rank
@@ Property / cites work @@
+Position-Restricted Substring Searching
@@ Property / cites work: Position-Restricted Substring Searching / rank @@
+Normal rank
@@ Property / cites work @@
+A Space-Economical Suffix Tree Construction Algorithm
+Normal rank
@@ Property / cites work @@
+Sorted Range Reporting
@@ Property / cites work: Sorted Range Reporting / rank @@
+Normal rank
@@ Property / cites work @@
+Application of Lempel-Ziv factorization to the approximation of grammar-based compression.
+Normal rank
@@ Property / cites work @@
+On-line construction of suffix trees
@@ Property / cites work: On-line construction of suffix trees / rank @@
+Normal rank
@@ Property / cites work @@
+A universal algorithm for sequential data compression
+Normal rank