Generalized substring compression (Q2437745)

For the substring-compression problem, we are asked to preprocess a string $S [1..n]$ such that later, given $i$ and $j$ with $1 \leq i \leq j \leq n$, we can quickly return the output of some pre-specified compression algorithm on $S [i..j]$; in this paper the algorithm is LZ77 [\textit{J. Ziv} and \textit{A. Lempel}, IEEE Trans. Inf. Theory 23, 337--343 (1977; Zbl 0379.94010)]. For the generalized substring-compression problem, we are asked to preprocess $S$ such that later, given $i$, $j$, $\alpha$ and $\beta$ with $1 \leq \alpha \leq \beta \leq n$, we can quickly return the parse of $S [i..j]$ that LZ77 would generate when already given $S [\alpha..\beta]$ as a context. That is, we return the suffix of the output of LZ77 on $S [\alpha..\beta] \$ S [i..j]$, where \$ is a special character not occurring in $S$, that is generated while LZ77 is processing $S [i..j]$. \textit{G. Cormode} and \textit{S. Muthukrishnan} [``Substring compression problems'', in: Proceedings of the sixteenth annual ACM-SIAM symposium on discrete algorithms, SODA 2005. New York, NY: Association for Computing Machinery (2005)] introduced and studied these problems. Via a reduction to range reporting, they gave an $O (n \log^\epsilon n)$-space data structure for substring compression with $O (C (i, j) \log n \log \log n)$ query time, where $C (i, j)$ is the number of phrases in the LZ77 parse of $S [i..j]$. They also gave a data structure for generalized substring compression, but it was faulty. In this paper, the authors improve Cormode and Muthukrishnan's result for substring compression, via a reduction to the range-successor problem [\textit{H.-P. Lenhof} and \textit{M. Smid}, RAIRO, Inform. Théor. Appl. 28, No. 1, 25--49 (1994; Zbl 0998.68520)], and give the first correct data structure for generalized substring compression, via a reduction to range emptiness. They achieve various time-space tradeoffs by choosing the data structures for range successor and range emptiness appropriately. For example, they obtain linear-space data structures with $O (C (i, j) \log^\epsilon n)$ and $O \left( C_{\alpha, \beta} (i, j) \log \left( \frac{j - i}{C_{\alpha, \beta} (i, j)} \right) \log^\epsilon n \right)$ query times, respectively, for substring compression and generalized substring compression. Suppose we have already encoded $S [i..k - 1]$ and now we want to find the next phrase in the LZ77 parse of $S [i..j]$. For substring compression, we should find the length of the longest common prefix (LCP) of $S [k..j]$ and any of $S [i..n], \dots, S [k - 1..n]$. To be able to do this quickly, we represent the suffixes as points on a grid: if $S [y..n]$ is lexicographically $x$th among the suffixes of $S$, then we represent it as the point $(x, y)$. Finding the length of the LCP of $S [k..j]$ and any of $S [i..n], \dots, S [k - 1..n]$ that are lexicographically less (or greater) than $S [k..j]$, is equivalent to finding the rightmost (or leftmost) point whose $y$-coordinate is between $i$ and $k - 1$ and whose $x$-coordinate is less (or greater) than that of the point with $y$-coordinate $k$. For generalized substring compression, we should find the length of the LCP of $S [k..j]$ and any of $S [\alpha..\beta], S [\alpha + 1..\beta], \dots, S [\beta], S [i..n], \dots, S [k - 1..n]$. Let $x_{\min}$ and $x_{\max}$ be the lexicographic ranks of the lexicographically smallest and largest suffixes of $S$ that share prefixes of length at least $\ell \leq j - k + 1$ with $S [k..j]$; these can be computed quickly using, e.g., an augmented suffix tree. Determining whether the LCP of $S [k..j]$ and $S [\alpha..\beta], S [\alpha + 1..\beta], \dots, S [\beta]$ is at least $\ell$ is equivalent to determining whether there is a point in the rectangle $[x_{\min}, x_{\max}] \times [\alpha, \beta]$. Thus, generalized substring compression can be solved using a search with a range-emptiness query at each step.

0 references

reviewed by

Travis Gagie

0 references

zbMATH Keywords

data compression

0 references

Lempel-Ziv compression

0 references

suffix tree

0 references

range searching