Generalized substring compression (Q2437745)

From MaRDI portal
scientific article
Language Label Description Also known as
English
Generalized substring compression
scientific article

    Statements

    Generalized substring compression (English)
    0 references
    0 references
    0 references
    0 references
    0 references
    13 March 2014
    0 references
    For the substring-compression problem, we are asked to preprocess a string \(S [1..n]\) such that later, given \(i\) and \(j\) with \(1 \leq i \leq j \leq n\), we can quickly return the output of some pre-specified compression algorithm on \(S [i..j]\); in this paper the algorithm is LZ77 [\textit{J. Ziv} and \textit{A. Lempel}, IEEE Trans. Inf. Theory 23, 337--343 (1977; Zbl 0379.94010)]. For the generalized substring-compression problem, we are asked to preprocess \(S\) such that later, given \(i\), \(j\), \(\alpha\) and \(\beta\) with \(1 \leq \alpha \leq \beta \leq n\), we can quickly return the parse of \(S [i..j]\) that LZ77 would generate when already given \(S [\alpha..\beta]\) as a context. That is, we return the suffix of the output of LZ77 on \(S [\alpha..\beta] \$ S [i..j]\), where \$ is a special character not occurring in \(S\), that is generated while LZ77 is processing \(S [i..j]\). \textit{G. Cormode} and \textit{S. Muthukrishnan} [``Substring compression problems'', in: Proceedings of the sixteenth annual ACM-SIAM symposium on discrete algorithms, SODA 2005. New York, NY: Association for Computing Machinery (2005)] introduced and studied these problems. Via a reduction to range reporting, they gave an \(O (n \log^\epsilon n)\)-space data structure for substring compression with \(O (C (i, j) \log n \log \log n)\) query time, where \(C (i, j)\) is the number of phrases in the LZ77 parse of \(S [i..j]\). They also gave a data structure for generalized substring compression, but it was faulty. In this paper, the authors improve Cormode and Muthukrishnan's result for substring compression, via a reduction to the range-successor problem [\textit{H.-P. Lenhof} and \textit{M. Smid}, RAIRO, Inform. Théor. Appl. 28, No. 1, 25--49 (1994; Zbl 0998.68520)], and give the first correct data structure for generalized substring compression, via a reduction to range emptiness. They achieve various time-space tradeoffs by choosing the data structures for range successor and range emptiness appropriately. For example, they obtain linear-space data structures with \(O (C (i, j) \log^\epsilon n)\) and \(O \left( C_{\alpha, \beta} (i, j) \log \left( \frac{j - i}{C_{\alpha, \beta} (i, j)} \right) \log^\epsilon n \right)\) query times, respectively, for substring compression and generalized substring compression. Suppose we have already encoded \(S [i..k - 1]\) and now we want to find the next phrase in the LZ77 parse of \(S [i..j]\). For substring compression, we should find the length of the longest common prefix (LCP) of \(S [k..j]\) and any of \(S [i..n], \dots, S [k - 1..n]\). To be able to do this quickly, we represent the suffixes as points on a grid: if \(S [y..n]\) is lexicographically \(x\)th among the suffixes of \(S\), then we represent it as the point \((x, y)\). Finding the length of the LCP of \(S [k..j]\) and any of \(S [i..n], \dots, S [k - 1..n]\) that are lexicographically less (or greater) than \(S [k..j]\), is equivalent to finding the rightmost (or leftmost) point whose \(y\)-coordinate is between \(i\) and \(k - 1\) and whose \(x\)-coordinate is less (or greater) than that of the point with \(y\)-coordinate \(k\). For generalized substring compression, we should find the length of the LCP of \(S [k..j]\) and any of \(S [\alpha..\beta], S [\alpha + 1..\beta], \dots, S [\beta], S [i..n], \dots, S [k - 1..n]\). Let \(x_{\min}\) and \(x_{\max}\) be the lexicographic ranks of the lexicographically smallest and largest suffixes of \(S\) that share prefixes of length at least \(\ell \leq j - k + 1\) with \(S [k..j]\); these can be computed quickly using, e.g., an augmented suffix tree. Determining whether the LCP of \(S [k..j]\) and \(S [\alpha..\beta], S [\alpha + 1..\beta], \dots, S [\beta]\) is at least \(\ell\) is equivalent to determining whether there is a point in the rectangle \([x_{\min}, x_{\max}] \times [\alpha, \beta]\). Thus, generalized substring compression can be solved using a search with a range-emptiness query at each step.
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    data compression
    0 references
    Lempel-Ziv compression
    0 references
    suffix tree
    0 references
    range searching
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references