Permuted scaled matching (Q294926): Difference between revisions

The authors show how, given a text \(T\) of length \(n\) over an alphabet \(\Sigma\) and a multiset \(P\) of \(m\) characters from the same alphabet, for all positive integers \(k\) simultaneously we can find all the substrings (called permuted \(k\)-scaled occurrences of \(P\)) of \(T\) that contain exactly \(k\) times as many copies of each distinct character as there are in \(P\). Specifically, they give an \(O (n \log |\Sigma|)\)-time deterministic algorithm and an \(O (n)\)-time Monte Carlo algorithm that return, for each distinct position \(i\), the smallest \(k\) (if one exists) such that \(T [i, i + k m - 1]\) is a permuted \(k\)-scaled occurrence of \(P\). This restricts the size of the possible output to \(O (n)\) without sacrificing any information since, if \(T [i, j]\) and \(T [j + 1, h]\) are permuted scaled occurrences of \(P\), then \(T [i, h]\) is as well. The first main idea behind the authors' algorithms is that \(T [i, j]\) is a permuted scaled occurrence of \(P\) if and only if the following conditions hold: 1) for each character \(\sigma \in \Sigma\), \(\#_\sigma (T [0, i - 1]) \equiv \#_\sigma (T [0, j]) \bmod \#_\sigma (P)\), where \(\#_\sigma\) returns the number of copies of \(\sigma\) in its argument; 2) for each consecutive pair of characters \(\sigma_\ell, \sigma_{\ell + 1} \in \Sigma\), \[ \left\lfloor \frac{\#_{\sigma_\ell} (T [0, i - 1])}{\#_{\sigma_\ell (P)}} \right\rfloor - \left\lfloor \frac{\#_{\sigma_{\ell + 1}} (T [0, i - 1])}{\#_{\sigma_{\ell + 1} (P)}} \right\rfloor = \left\lfloor \frac{\#_{\sigma_\ell} (T [0, j])}{\#_{\sigma_\ell (P)}} \right\rfloor - \left\lfloor \frac{\#_{\sigma_{\ell + 1}} (T [0, j])}{\#_{\sigma_{\ell + 1} (P)}} \right\rfloor\;. \] The second main idea is to compute for each position \(i\) in \(T\) a vector of length \(2 |\Sigma|\) whose components are, to start with, \(\#_\sigma (T [0, i - 1])\) for each character \(\sigma \in \Sigma\); next, \(\left\lfloor \frac{\#_{\sigma_\ell} (T [0, i - 1])}{\#_{\sigma_\ell (P)}} \right\rfloor - \left\lfloor \frac{\#_{\sigma_{\ell + 1}} (T [0, i - 1])}{\#_{\sigma_{\ell + 1} (P)}} \right\rfloor\) for each consecutive pair of characters \(\sigma_\ell, \sigma_{\ell + 1} \in \Sigma\); and, finally, \(i\). The vectors for positions \(i\) and \(j\) with \(i < j\) agree on their prefixes of length \(2 |\Sigma| - 1\) if and only if \(T [i, j]\) is a permuted scaled occurrence of \(P\). This immediately yields an \(O (n |\Sigma|)\)-time algorithm based on computing the vectors naïvely and radix-sorting them. The third main idea is to speed up the computation of the vectors using deterministic renaming or randomized hashing and the fact that vectors for consecutive positions differ on \(O (1)\) components. Applying a result by \textit{T. Kociumaka} et al. [J. Comput. Syst. Sci. 84, 205--218 (2017; Zbl 1353.68225)] to group the vectors on the basis of their first \(2 |\Sigma| - 1\) components, then yields the deterministic and Monte Carlo time bounds stated above.

0 references

reviewed by

Travis Gagie

0 references

zbMATH Keywords

pattern matching

0 references

scaled matching

0 references

jumbled matching

0 references

Identifiers

zbMATH Open document ID

1355.68299

0 references

DOI

10.1016/j.tcs.2016.02.036

0 references

Mathematics Subject Classification ID

0 references

0 references

0 references

@@ Property / review text @@
+The authors show how, given a text \(T\) of length \(n\) over an alphabet \(\Sigma\) and a multiset \(P\) of \(m\) characters from the same alphabet, for all positive integers \(k\) simultaneously we can find all the substrings (called permuted \(k\)-scaled occurrences of \(P\)) of \(T\) that contain exactly \(k\) times as many copies of each distinct character as there are in \(P\). Specifically, they give an \(O (n \log |\Sigma|)\)-time deterministic algorithm and an \(O (n)\)-time Monte Carlo algorithm that return, for each distinct position \(i\), the smallest \(k\) (if one exists) such that \(T [i, i + k m - 1]\) is a permuted \(k\)-scaled occurrence of \(P\). This restricts the size of the possible output to \(O (n)\) without sacrificing any information since, if \(T [i, j]\) and \(T [j + 1, h]\) are permuted scaled occurrences of \(P\), then \(T [i, h]\) is as well.  The first main idea behind the authors' algorithms is that \(T [i, j]\) is a permuted scaled occurrence of \(P\) if and only if the following conditions hold:  1) for each character \(\sigma \in \Sigma\), \(\#_\sigma (T [0, i - 1]) \equiv \#_\sigma (T [0, j]) \bmod \#_\sigma (P)\), where \(\#_\sigma\) returns the number of copies of \(\sigma\) in its argument;  2) for each consecutive pair of characters \(\sigma_\ell, \sigma_{\ell + 1} \in \Sigma\),  \[ \left\lfloor \frac{\#_{\sigma_\ell} (T [0, i - 1])}{\#_{\sigma_\ell (P)}} \right\rfloor - \left\lfloor \frac{\#_{\sigma_{\ell + 1}} (T [0, i - 1])}{\#_{\sigma_{\ell + 1} (P)}} \right\rfloor = \left\lfloor \frac{\#_{\sigma_\ell} (T [0, j])}{\#_{\sigma_\ell (P)}} \right\rfloor - \left\lfloor \frac{\#_{\sigma_{\ell + 1}} (T [0, j])}{\#_{\sigma_{\ell + 1} (P)}} \right\rfloor\;. \]   The second main idea is to compute for each position \(i\) in \(T\) a vector of length \(2 |\Sigma|\) whose components are, to start with, \(\#_\sigma (T [0, i - 1])\) for each character \(\sigma \in \Sigma\); next, \(\left\lfloor \frac{\#_{\sigma_\ell} (T [0, i - 1])}{\#_{\sigma_\ell (P)}} \right\rfloor - \left\lfloor \frac{\#_{\sigma_{\ell + 1}} (T [0, i - 1])}{\#_{\sigma_{\ell + 1} (P)}} \right\rfloor\) for each consecutive pair of characters \(\sigma_\ell, \sigma_{\ell + 1} \in \Sigma\); and, finally, \(i\). The vectors for positions \(i\) and \(j\) with \(i < j\) agree on their prefixes of length \(2 |\Sigma| - 1\) if and only if \(T [i, j]\) is a permuted scaled occurrence of \(P\). This immediately yields an \(O (n |\Sigma|)\)-time algorithm based on computing the vectors naïvely and radix-sorting them. The third main idea is to speed up the computation of the vectors using deterministic renaming or randomized hashing and the fact that vectors for consecutive positions differ on \(O (1)\) components. Applying a result by \textit{T. Kociumaka} et al. [J. Comput. Syst. Sci. 84, 205--218 (2017; Zbl 1353.68225)] to group the vectors on the basis of their first \(2 |\Sigma| - 1\) components, then yields the deterministic and Monte Carlo time bounds stated above.
+Normal rank
@@ Property / reviewed by @@
+Travis Gagie
@@ Property / reviewed by: Travis Gagie / rank @@
+Normal rank
@@ Property / Mathematics Subject Classification ID @@
+W32
@@ Property / Mathematics Subject Classification ID: 68W32 / rank @@
+Normal rank
@@ Property / Mathematics Subject Classification ID @@
+R15
@@ Property / Mathematics Subject Classification ID: 68R15 / rank @@
+Normal rank
@@ Property / zbMATH DE Number @@
+6594178
@@ Property / zbMATH DE Number: 6594178 / rank @@
+Normal rank
@@ Property / zbMATH Keywords @@
+pattern matching
@@ Property / zbMATH Keywords: pattern matching / rank @@
+Normal rank
@@ Property / zbMATH Keywords @@
+scaled matching
@@ Property / zbMATH Keywords: scaled matching / rank @@
+Normal rank
@@ Property / zbMATH Keywords @@
+jumbled matching
@@ Property / zbMATH Keywords: jumbled matching / rank @@
+Normal rank

Permuted scaled matching (Q294926): Difference between revisions

Revision as of 20:38, 27 June 2023

Statements

Identifiers

Sitelinks

Mathematics(0 entries)