Lempel-Ziv compressed structures for document retrieval (Q2272976)

From MaRDI portal

Jump to:navigation, search

scientific article

Language	Label	Description	Also known as
English	Lempel-Ziv compressed structures for document retrieval	scientific article

Statements

scholarly article

0 references

Lempel-Ziv compressed structures for document retrieval (English)

0 references

Héctor Ferrada

0 references

Gonzalo Navarro

0 references

Information and Computation

0 references

publication date

17 September 2019

0 references

A lot of research was done in the field of reducing the space of document retrieval indexes. It is mostly focused on the reduction of the number of required bits per character (bpc). This is around 17--21 bpc, but some representations use only 7 bpc -- if so, there is some additional cost of slowing down query times per retrieved document (at best \(O(\lg^{1+\epsilon} n)\), where \(n\) is the collection size and \(\epsilon > 0\) is a small constant). Currently, there are several reduced-space solutions built on compressed suffix arrays, where the compression is statistical. As an alternative there are the LZ-indexes (Lempel-Ziv 1978, LZ78 compression method). They are faster and have also other advantages. LZ indexes are a novel efficient approach to document search and retrieval. This is related to the fact that LZ78 parses the text into \(n'\) phrases: at most there are \(n\lg_\sigma n\) (\(\sigma\) the size of the alphabet) phrases and in practice this is 1/20--1/6 of \(n\). The authors were able to use the fast and large structures on the sequences of \(n'\) phrases, not of \(n\) symbols, and reduce their size by an order of magnitude, with the same speed. The proposed indexes use \((3-5)nH_l + O(n)\) bits, where \(H_l\) is the \(l\)th-order entropy and in practice this gives 7--10 bpc. The worst case required to retrieve the answer is up to \(O(m \lg^2 n)\) time (\(m\) is the pattern length), and \(O(\lg^3 n)\) for top-\(k\) retrieval. It turned out that each answer was returned in 10--100 microseconds on their computer machine. During the experiments it turned out that the first 75\%--80\% of the answers were retrieved in \(O(1)\) time. The top-\(k\) retrieval index returns approximate answers using \(2nH_l + O(n)\) bits (4--6 bpc). The authors were able to show that their proposed solutions are faster than existing ones needing very short times. The whole paper is divided into 7 sections. Section 2 shows the theoretical background, whereas the next one gives a description of the original LZ78-based pattern-matching index. Section 4 presents the main results obtained, with expansion in Section 5 to a new solution for document listing, and some test in Section 6 related to important details for approximate top-\(k\) document retrieval. The paper ends with a conclusion in Section 7.

0 references

Dominik Strzałka

0 references

zbMATH Keywords

document retrieval

0 references

document listing

0 references

top-\(k\) queries

0 references

string databases

0 references

compressed data structures

0 references

describes a project that uses

0 references

MaRDI profile type

MaRDI publication profile

0 references

full work available at URL

https://doi.org/10.1016/j.ic.2019.01.006

0 references

Suffix Arrays: A New Method for On-Line String Searches

0 references

0 references

Space-Efficient Frameworks for Top- <i>k</i> String Retrieval

0 references

0 references

Spaces, Trees, and Colors

0 references

Top-k Ranked Document Search in General Text Databases

0 references

Improved Single-Term Top-<i>k</i> Document Retrieval

0 references

Succinct data structures for flexible text retrieval systems

0 references

General Document Retrieval in Compact Space

0 references

Improved compressed indexes for full-text document retrieval

0 references

Compressed representations of sequences and full-text indexes

0 references

An analysis of the Burrows—Wheeler transform

0 references

Indexing compressed text

0 references

Indexing text using the Ziv--Lempel trie

0 references

Implementing the LZ-index

0 references

Stronger Lempel-Ziv based compressed text indexing

0 references

Compression of individual sequences via variable-rate coding

0 references

0 references

Succinct indexable dictionaries with applications to encoding <i>k</i> -ary trees, prefix sums and multisets

0 references

Succinct Representation of Balanced Parentheses and Static Trees

0 references

Representing trees of higher degree

0 references

Fully Functional Static and Dynamic Succinct Trees

0 references

Succinct Trees in Practice

0 references

0 references

0 references

Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching

0 references

New text indexing functionalities of the compressed suffix arrays

0 references

Space-Efficient Preprocessing Schemes for Range Minimum Queries on Static Arrays

0 references

Compression of Low Entropy Strings with Lempel--Ziv Algorithms

0 references

Space-Efficient Algorithms for Document Retrieval

0 references

New algorithms on wavelet trees and applications to information retrieval

0 references

Space-Efficient Framework for Top-k String Retrieval Problems

0 references

Space-efficient data-analysis queries on grids

0 references

Improved range minimum queries

0 references

0 references

On the height of digital trees and related problems

0 references

On compressing and indexing repetitive sequences

0 references

Identifiers

zbMATH Open document ID

0 references

Mathematics Subject Classification ID

0 references

0 references

0 references

0 references

zbMATH DE Number

0 references

0 references

0 references

10.1016/J.IC.2019.01.006

0 references

Sitelinks

Mathematics(1 entry)

mardi Publication:2272976

Retrieved from "https://portal.mardi4nfdi.de/w/index.php?title=Item:Q2272976&oldid=38980362"