Generalized substring compression (Q2437745): Difference between revisions

From MaRDI portal
Set OpenAlex properties.
ReferenceBot (talk | contribs)
Changed an Item
 
Property / cites work
 
Property / cites work: Text Indexing and Dictionary Matching with One Error / rank
 
Normal rank
Property / cites work
 
Property / cites work: The level ancestor problem simplified / rank
 
Normal rank
Property / cites work
 
Property / cites work: Substring Range Reporting / rank
 
Normal rank
Property / cites work
 
Property / cites work: Orthogonal range searching on the RAM, revisited / rank
 
Normal rank
Property / cites work
 
Property / cites work: Approximating the smallest grammar / rank
 
Normal rank
Property / cites work
 
Property / cites work: Q2921685 / rank
 
Normal rank
Property / cites work
 
Property / cites work: Comparing DNA Sequence Collections by Direct Comparison of Compressed Text Indexes / rank
 
Normal rank
Property / cites work
 
Property / cites work: Computing longest previous factor in linear time and applications / rank
 
Normal rank
Property / cites work
 
Property / cites work: Q4910720 / rank
 
Normal rank
Property / cites work
 
Property / cites work: Dynamic Text Indexing under String Updates / rank
 
Normal rank
Property / cites work
 
Property / cites work: Multi-method dispatching / rank
 
Normal rank
Property / cites work
 
Property / cites work: Fast Algorithms for Finding Nearest Common Ancestors / rank
 
Normal rank
Property / cites work
 
Property / cites work: Q5390005 / rank
 
Normal rank
Property / cites work
 
Property / cites work: On position restricted substring searching in succinct space / rank
 
Normal rank
Property / cites work
 
Property / cites work: Linear Time Lempel-Ziv Factorization: Simple, Fast, Small / rank
 
Normal rank
Property / cites work
 
Property / cites work: Range Non-overlapping Indexing and Successive List Indexing / rank
 
Normal rank
Property / cites work
 
Property / cites work: Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval / rank
 
Normal rank
Property / cites work
 
Property / cites work: Using persistent data structures for adding range restrictions to searching problems / rank
 
Normal rank
Property / cites work
 
Property / cites work: Orthogonal Range Searching for Text Indexing / rank
 
Normal rank
Property / cites work
 
Property / cites work: Position-Restricted Substring Searching / rank
 
Normal rank
Property / cites work
 
Property / cites work: A Space-Economical Suffix Tree Construction Algorithm / rank
 
Normal rank
Property / cites work
 
Property / cites work: Sorted Range Reporting / rank
 
Normal rank
Property / cites work
 
Property / cites work: Application of Lempel-Ziv factorization to the approximation of grammar-based compression. / rank
 
Normal rank
Property / cites work
 
Property / cites work: On-line construction of suffix trees / rank
 
Normal rank
Property / cites work
 
Property / cites work: A universal algorithm for sequential data compression / rank
 
Normal rank

Latest revision as of 10:49, 7 July 2024

scientific article
Language Label Description Also known as
English
Generalized substring compression
scientific article

    Statements

    Generalized substring compression (English)
    0 references
    0 references
    0 references
    0 references
    0 references
    13 March 2014
    0 references
    For the substring-compression problem, we are asked to preprocess a string \(S [1..n]\) such that later, given \(i\) and \(j\) with \(1 \leq i \leq j \leq n\), we can quickly return the output of some pre-specified compression algorithm on \(S [i..j]\); in this paper the algorithm is LZ77 [\textit{J. Ziv} and \textit{A. Lempel}, IEEE Trans. Inf. Theory 23, 337--343 (1977; Zbl 0379.94010)]. For the generalized substring-compression problem, we are asked to preprocess \(S\) such that later, given \(i\), \(j\), \(\alpha\) and \(\beta\) with \(1 \leq \alpha \leq \beta \leq n\), we can quickly return the parse of \(S [i..j]\) that LZ77 would generate when already given \(S [\alpha..\beta]\) as a context. That is, we return the suffix of the output of LZ77 on \(S [\alpha..\beta] \$ S [i..j]\), where \$ is a special character not occurring in \(S\), that is generated while LZ77 is processing \(S [i..j]\). \textit{G. Cormode} and \textit{S. Muthukrishnan} [``Substring compression problems'', in: Proceedings of the sixteenth annual ACM-SIAM symposium on discrete algorithms, SODA 2005. New York, NY: Association for Computing Machinery (2005)] introduced and studied these problems. Via a reduction to range reporting, they gave an \(O (n \log^\epsilon n)\)-space data structure for substring compression with \(O (C (i, j) \log n \log \log n)\) query time, where \(C (i, j)\) is the number of phrases in the LZ77 parse of \(S [i..j]\). They also gave a data structure for generalized substring compression, but it was faulty. In this paper, the authors improve Cormode and Muthukrishnan's result for substring compression, via a reduction to the range-successor problem [\textit{H.-P. Lenhof} and \textit{M. Smid}, RAIRO, Inform. Théor. Appl. 28, No. 1, 25--49 (1994; Zbl 0998.68520)], and give the first correct data structure for generalized substring compression, via a reduction to range emptiness. They achieve various time-space tradeoffs by choosing the data structures for range successor and range emptiness appropriately. For example, they obtain linear-space data structures with \(O (C (i, j) \log^\epsilon n)\) and \(O \left( C_{\alpha, \beta} (i, j) \log \left( \frac{j - i}{C_{\alpha, \beta} (i, j)} \right) \log^\epsilon n \right)\) query times, respectively, for substring compression and generalized substring compression. Suppose we have already encoded \(S [i..k - 1]\) and now we want to find the next phrase in the LZ77 parse of \(S [i..j]\). For substring compression, we should find the length of the longest common prefix (LCP) of \(S [k..j]\) and any of \(S [i..n], \dots, S [k - 1..n]\). To be able to do this quickly, we represent the suffixes as points on a grid: if \(S [y..n]\) is lexicographically \(x\)th among the suffixes of \(S\), then we represent it as the point \((x, y)\). Finding the length of the LCP of \(S [k..j]\) and any of \(S [i..n], \dots, S [k - 1..n]\) that are lexicographically less (or greater) than \(S [k..j]\), is equivalent to finding the rightmost (or leftmost) point whose \(y\)-coordinate is between \(i\) and \(k - 1\) and whose \(x\)-coordinate is less (or greater) than that of the point with \(y\)-coordinate \(k\). For generalized substring compression, we should find the length of the LCP of \(S [k..j]\) and any of \(S [\alpha..\beta], S [\alpha + 1..\beta], \dots, S [\beta], S [i..n], \dots, S [k - 1..n]\). Let \(x_{\min}\) and \(x_{\max}\) be the lexicographic ranks of the lexicographically smallest and largest suffixes of \(S\) that share prefixes of length at least \(\ell \leq j - k + 1\) with \(S [k..j]\); these can be computed quickly using, e.g., an augmented suffix tree. Determining whether the LCP of \(S [k..j]\) and \(S [\alpha..\beta], S [\alpha + 1..\beta], \dots, S [\beta]\) is at least \(\ell\) is equivalent to determining whether there is a point in the rectangle \([x_{\min}, x_{\max}] \times [\alpha, \beta]\). Thus, generalized substring compression can be solved using a search with a range-emptiness query at each step.
    0 references
    0 references
    data compression
    0 references
    Lempel-Ziv compression
    0 references
    suffix tree
    0 references
    range searching
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references

    Identifiers

    0 references
    0 references
    0 references
    0 references
    0 references
    0 references