Permuted scaled matching (Q294926)

The authors show how, given a text \(T\) of length \(n\) over an alphabet \(\Sigma\) and a multiset \(P\) of \(m\) characters from the same alphabet, for all positive integers \(k\) simultaneously we can find all the substrings (called permuted \(k\)-scaled occurrences of \(P\)) of \(T\) that contain exactly \(k\) times as many copies of each distinct character as there are in \(P\). Specifically, they give an \(O (n \log |\Sigma|)\)-time deterministic algorithm and an \(O (n)\)-time Monte Carlo algorithm that return, for each distinct position \(i\), the smallest \(k\) (if one exists) such that \(T [i, i + k m - 1]\) is a permuted \(k\)-scaled occurrence of \(P\). This restricts the size of the possible output to \(O (n)\) without sacrificing any information since, if \(T [i, j]\) and \(T [j + 1, h]\) are permuted scaled occurrences of \(P\), then \(T [i, h]\) is as well. The first main idea behind the authors' algorithms is that \(T [i, j]\) is a permuted scaled occurrence of \(P\) if and only if the following conditions hold: 1) for each character \(\sigma \in \Sigma\), \(\#_\sigma (T [0, i - 1]) \equiv \#_\sigma (T [0, j]) \bmod \#_\sigma (P)\), where \(\#_\sigma\) returns the number of copies of \(\sigma\) in its argument; 2) for each consecutive pair of characters \(\sigma_\ell, \sigma_{\ell + 1} \in \Sigma\), \[ \left\lfloor \frac{\#_{\sigma_\ell} (T [0, i - 1])}{\#_{\sigma_\ell (P)}} \right\rfloor - \left\lfloor \frac{\#_{\sigma_{\ell + 1}} (T [0, i - 1])}{\#_{\sigma_{\ell + 1} (P)}} \right\rfloor = \left\lfloor \frac{\#_{\sigma_\ell} (T [0, j])}{\#_{\sigma_\ell (P)}} \right\rfloor - \left\lfloor \frac{\#_{\sigma_{\ell + 1}} (T [0, j])}{\#_{\sigma_{\ell + 1} (P)}} \right\rfloor\;. \] The second main idea is to compute for each position \(i\) in \(T\) a vector of length \(2 |\Sigma|\) whose components are, to start with, \(\#_\sigma (T [0, i - 1])\) for each character \(\sigma \in \Sigma\); next, \(\left\lfloor \frac{\#_{\sigma_\ell} (T [0, i - 1])}{\#_{\sigma_\ell (P)}} \right\rfloor - \left\lfloor \frac{\#_{\sigma_{\ell + 1}} (T [0, i - 1])}{\#_{\sigma_{\ell + 1} (P)}} \right\rfloor\) for each consecutive pair of characters \(\sigma_\ell, \sigma_{\ell + 1} \in \Sigma\); and, finally, \(i\). The vectors for positions \(i\) and \(j\) with \(i < j\) agree on their prefixes of length \(2 |\Sigma| - 1\) if and only if \(T [i, j]\) is a permuted scaled occurrence of \(P\). This immediately yields an \(O (n |\Sigma|)\)-time algorithm based on computing the vectors naïvely and radix-sorting them. The third main idea is to speed up the computation of the vectors using deterministic renaming or randomized hashing and the fact that vectors for consecutive positions differ on \(O (1)\) components. Applying a result by \textit{T. Kociumaka} et al. [J. Comput. Syst. Sci. 84, 205--218 (2017; Zbl 1353.68225)] to group the vectors on the basis of their first \(2 |\Sigma| - 1\) components, then yields the deterministic and Monte Carlo time bounds stated above.

0 references

reviewed by

Travis Gagie

0 references

zbMATH Keywords

pattern matching

0 references

scaled matching

0 references

jumbled matching

0 references

MaRDI profile type