Lempel-Ziv compressed structures for document retrieval (Q2272976)

From MaRDI portal
scientific article
Language Label Description Also known as
English
Lempel-Ziv compressed structures for document retrieval
scientific article

    Statements

    Lempel-Ziv compressed structures for document retrieval (English)
    0 references
    0 references
    0 references
    17 September 2019
    0 references
    A lot of research was done in the field of reducing the space of document retrieval indexes. It is mostly focused on the reduction of the number of required bits per character (bpc). This is around 17--21 bpc, but some representations use only 7 bpc -- if so, there is some additional cost of slowing down query times per retrieved document (at best \(O(\lg^{1+\epsilon} n)\), where \(n\) is the collection size and \(\epsilon > 0\) is a small constant). Currently, there are several reduced-space solutions built on compressed suffix arrays, where the compression is statistical. As an alternative there are the LZ-indexes (Lempel-Ziv 1978, LZ78 compression method). They are faster and have also other advantages. LZ indexes are a novel efficient approach to document search and retrieval. This is related to the fact that LZ78 parses the text into \(n'\) phrases: at most there are \(n\lg_\sigma n\) (\(\sigma\) the size of the alphabet) phrases and in practice this is 1/20--1/6 of \(n\). The authors were able to use the fast and large structures on the sequences of \(n'\) phrases, not of \(n\) symbols, and reduce their size by an order of magnitude, with the same speed. The proposed indexes use \((3-5)nH_l + O(n)\) bits, where \(H_l\) is the \(l\)th-order entropy and in practice this gives 7--10 bpc. The worst case required to retrieve the answer is up to \(O(m \lg^2 n)\) time (\(m\) is the pattern length), and \(O(\lg^3 n)\) for top-\(k\) retrieval. It turned out that each answer was returned in 10--100 microseconds on their computer machine. During the experiments it turned out that the first 75\%--80\% of the answers were retrieved in \(O(1)\) time. The top-\(k\) retrieval index returns approximate answers using \(2nH_l + O(n)\) bits (4--6 bpc). The authors were able to show that their proposed solutions are faster than existing ones needing very short times. The whole paper is divided into 7 sections. Section 2 shows the theoretical background, whereas the next one gives a description of the original LZ78-based pattern-matching index. Section 4 presents the main results obtained, with expansion in Section 5 to a new solution for document listing, and some test in Section 6 related to important details for approximate top-\(k\) document retrieval. The paper ends with a conclusion in Section 7.
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    document retrieval
    0 references
    document listing
    0 references
    top-\(k\) queries
    0 references
    string databases
    0 references
    compressed data structures
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references