Speeding up q-gram mining on grammar-based compressed texts

Abstract: We present an efficient algorithm for calculating

q

-gram frequencies on strings represented in compressed form, namely, as a straight line program (SLP). Given an SLP

m a t h c a l T

of size

n

that represents string

T

, the algorithm computes the occurrence frequencies of all

q

-grams in

T

, by reducing the problem to the weighted

q

-gram frequencies problem on a trie-like structure of size

m = | T | - m a t h i t d u p (q, m a t h c a l T)

, where

m a t h i t d u p (q, m a t h c a l T)

is a quantity that represents the amount of redundancy that the SLP captures with respect to

q

-grams. The reduced problem can be solved in linear time. Since

m = O (q n)

, the running time of our algorithm is

O (m i n | T | - m a t h i t d u p (q, m a t h c a l T), q n)

, improving our previous

O (q n)

algorithm when

q = O m e g a (| T | / n)

.

Recommendations

Cited in

(5)

This page was built for publication: Speeding up \(q\)-gram mining on grammar-based compressed texts