Permuted scaled matching (Q294926): Difference between revisions
From MaRDI portal
Created a new Item |
Changed an Item |
||
Property / review text | |||
The authors show how, given a text \(T\) of length \(n\) over an alphabet \(\Sigma\) and a multiset \(P\) of \(m\) characters from the same alphabet, for all positive integers \(k\) simultaneously we can find all the substrings (called permuted \(k\)-scaled occurrences of \(P\)) of \(T\) that contain exactly \(k\) times as many copies of each distinct character as there are in \(P\). Specifically, they give an \(O (n \log |\Sigma|)\)-time deterministic algorithm and an \(O (n)\)-time Monte Carlo algorithm that return, for each distinct position \(i\), the smallest \(k\) (if one exists) such that \(T [i, i + k m - 1]\) is a permuted \(k\)-scaled occurrence of \(P\). This restricts the size of the possible output to \(O (n)\) without sacrificing any information since, if \(T [i, j]\) and \(T [j + 1, h]\) are permuted scaled occurrences of \(P\), then \(T [i, h]\) is as well. The first main idea behind the authors' algorithms is that \(T [i, j]\) is a permuted scaled occurrence of \(P\) if and only if the following conditions hold: 1) for each character \(\sigma \in \Sigma\), \(\#_\sigma (T [0, i - 1]) \equiv \#_\sigma (T [0, j]) \bmod \#_\sigma (P)\), where \(\#_\sigma\) returns the number of copies of \(\sigma\) in its argument; 2) for each consecutive pair of characters \(\sigma_\ell, \sigma_{\ell + 1} \in \Sigma\), \[ \left\lfloor \frac{\#_{\sigma_\ell} (T [0, i - 1])}{\#_{\sigma_\ell (P)}} \right\rfloor - \left\lfloor \frac{\#_{\sigma_{\ell + 1}} (T [0, i - 1])}{\#_{\sigma_{\ell + 1} (P)}} \right\rfloor = \left\lfloor \frac{\#_{\sigma_\ell} (T [0, j])}{\#_{\sigma_\ell (P)}} \right\rfloor - \left\lfloor \frac{\#_{\sigma_{\ell + 1}} (T [0, j])}{\#_{\sigma_{\ell + 1} (P)}} \right\rfloor\;. \] The second main idea is to compute for each position \(i\) in \(T\) a vector of length \(2 |\Sigma|\) whose components are, to start with, \(\#_\sigma (T [0, i - 1])\) for each character \(\sigma \in \Sigma\); next, \(\left\lfloor \frac{\#_{\sigma_\ell} (T [0, i - 1])}{\#_{\sigma_\ell (P)}} \right\rfloor - \left\lfloor \frac{\#_{\sigma_{\ell + 1}} (T [0, i - 1])}{\#_{\sigma_{\ell + 1} (P)}} \right\rfloor\) for each consecutive pair of characters \(\sigma_\ell, \sigma_{\ell + 1} \in \Sigma\); and, finally, \(i\). The vectors for positions \(i\) and \(j\) with \(i < j\) agree on their prefixes of length \(2 |\Sigma| - 1\) if and only if \(T [i, j]\) is a permuted scaled occurrence of \(P\). This immediately yields an \(O (n |\Sigma|)\)-time algorithm based on computing the vectors naïvely and radix-sorting them. The third main idea is to speed up the computation of the vectors using deterministic renaming or randomized hashing and the fact that vectors for consecutive positions differ on \(O (1)\) components. Applying a result by \textit{T. Kociumaka} et al. [J. Comput. Syst. Sci. 84, 205--218 (2017; Zbl 1353.68225)] to group the vectors on the basis of their first \(2 |\Sigma| - 1\) components, then yields the deterministic and Monte Carlo time bounds stated above. | |||
Property / review text: The authors show how, given a text \(T\) of length \(n\) over an alphabet \(\Sigma\) and a multiset \(P\) of \(m\) characters from the same alphabet, for all positive integers \(k\) simultaneously we can find all the substrings (called permuted \(k\)-scaled occurrences of \(P\)) of \(T\) that contain exactly \(k\) times as many copies of each distinct character as there are in \(P\). Specifically, they give an \(O (n \log |\Sigma|)\)-time deterministic algorithm and an \(O (n)\)-time Monte Carlo algorithm that return, for each distinct position \(i\), the smallest \(k\) (if one exists) such that \(T [i, i + k m - 1]\) is a permuted \(k\)-scaled occurrence of \(P\). This restricts the size of the possible output to \(O (n)\) without sacrificing any information since, if \(T [i, j]\) and \(T [j + 1, h]\) are permuted scaled occurrences of \(P\), then \(T [i, h]\) is as well. The first main idea behind the authors' algorithms is that \(T [i, j]\) is a permuted scaled occurrence of \(P\) if and only if the following conditions hold: 1) for each character \(\sigma \in \Sigma\), \(\#_\sigma (T [0, i - 1]) \equiv \#_\sigma (T [0, j]) \bmod \#_\sigma (P)\), where \(\#_\sigma\) returns the number of copies of \(\sigma\) in its argument; 2) for each consecutive pair of characters \(\sigma_\ell, \sigma_{\ell + 1} \in \Sigma\), \[ \left\lfloor \frac{\#_{\sigma_\ell} (T [0, i - 1])}{\#_{\sigma_\ell (P)}} \right\rfloor - \left\lfloor \frac{\#_{\sigma_{\ell + 1}} (T [0, i - 1])}{\#_{\sigma_{\ell + 1} (P)}} \right\rfloor = \left\lfloor \frac{\#_{\sigma_\ell} (T [0, j])}{\#_{\sigma_\ell (P)}} \right\rfloor - \left\lfloor \frac{\#_{\sigma_{\ell + 1}} (T [0, j])}{\#_{\sigma_{\ell + 1} (P)}} \right\rfloor\;. \] The second main idea is to compute for each position \(i\) in \(T\) a vector of length \(2 |\Sigma|\) whose components are, to start with, \(\#_\sigma (T [0, i - 1])\) for each character \(\sigma \in \Sigma\); next, \(\left\lfloor \frac{\#_{\sigma_\ell} (T [0, i - 1])}{\#_{\sigma_\ell (P)}} \right\rfloor - \left\lfloor \frac{\#_{\sigma_{\ell + 1}} (T [0, i - 1])}{\#_{\sigma_{\ell + 1} (P)}} \right\rfloor\) for each consecutive pair of characters \(\sigma_\ell, \sigma_{\ell + 1} \in \Sigma\); and, finally, \(i\). The vectors for positions \(i\) and \(j\) with \(i < j\) agree on their prefixes of length \(2 |\Sigma| - 1\) if and only if \(T [i, j]\) is a permuted scaled occurrence of \(P\). This immediately yields an \(O (n |\Sigma|)\)-time algorithm based on computing the vectors naïvely and radix-sorting them. The third main idea is to speed up the computation of the vectors using deterministic renaming or randomized hashing and the fact that vectors for consecutive positions differ on \(O (1)\) components. Applying a result by \textit{T. Kociumaka} et al. [J. Comput. Syst. Sci. 84, 205--218 (2017; Zbl 1353.68225)] to group the vectors on the basis of their first \(2 |\Sigma| - 1\) components, then yields the deterministic and Monte Carlo time bounds stated above. / rank | |||
Normal rank | |||
Property / reviewed by | |||
Property / reviewed by: Travis Gagie / rank | |||
Normal rank | |||
Property / Mathematics Subject Classification ID | |||
Property / Mathematics Subject Classification ID: 68W32 / rank | |||
Normal rank | |||
Property / Mathematics Subject Classification ID | |||
Property / Mathematics Subject Classification ID: 68R15 / rank | |||
Normal rank | |||
Property / zbMATH DE Number | |||
Property / zbMATH DE Number: 6594178 / rank | |||
Normal rank | |||
Property / zbMATH Keywords | |||
pattern matching | |||
Property / zbMATH Keywords: pattern matching / rank | |||
Normal rank | |||
Property / zbMATH Keywords | |||
scaled matching | |||
Property / zbMATH Keywords: scaled matching / rank | |||
Normal rank | |||
Property / zbMATH Keywords | |||
jumbled matching | |||
Property / zbMATH Keywords: jumbled matching / rank | |||
Normal rank |
Revision as of 20:38, 27 June 2023
scientific article
Language | Label | Description | Also known as |
---|---|---|---|
English | Permuted scaled matching |
scientific article |
Statements
Permuted scaled matching (English)
0 references
16 June 2016
0 references
The authors show how, given a text \(T\) of length \(n\) over an alphabet \(\Sigma\) and a multiset \(P\) of \(m\) characters from the same alphabet, for all positive integers \(k\) simultaneously we can find all the substrings (called permuted \(k\)-scaled occurrences of \(P\)) of \(T\) that contain exactly \(k\) times as many copies of each distinct character as there are in \(P\). Specifically, they give an \(O (n \log |\Sigma|)\)-time deterministic algorithm and an \(O (n)\)-time Monte Carlo algorithm that return, for each distinct position \(i\), the smallest \(k\) (if one exists) such that \(T [i, i + k m - 1]\) is a permuted \(k\)-scaled occurrence of \(P\). This restricts the size of the possible output to \(O (n)\) without sacrificing any information since, if \(T [i, j]\) and \(T [j + 1, h]\) are permuted scaled occurrences of \(P\), then \(T [i, h]\) is as well. The first main idea behind the authors' algorithms is that \(T [i, j]\) is a permuted scaled occurrence of \(P\) if and only if the following conditions hold: 1) for each character \(\sigma \in \Sigma\), \(\#_\sigma (T [0, i - 1]) \equiv \#_\sigma (T [0, j]) \bmod \#_\sigma (P)\), where \(\#_\sigma\) returns the number of copies of \(\sigma\) in its argument; 2) for each consecutive pair of characters \(\sigma_\ell, \sigma_{\ell + 1} \in \Sigma\), \[ \left\lfloor \frac{\#_{\sigma_\ell} (T [0, i - 1])}{\#_{\sigma_\ell (P)}} \right\rfloor - \left\lfloor \frac{\#_{\sigma_{\ell + 1}} (T [0, i - 1])}{\#_{\sigma_{\ell + 1} (P)}} \right\rfloor = \left\lfloor \frac{\#_{\sigma_\ell} (T [0, j])}{\#_{\sigma_\ell (P)}} \right\rfloor - \left\lfloor \frac{\#_{\sigma_{\ell + 1}} (T [0, j])}{\#_{\sigma_{\ell + 1} (P)}} \right\rfloor\;. \] The second main idea is to compute for each position \(i\) in \(T\) a vector of length \(2 |\Sigma|\) whose components are, to start with, \(\#_\sigma (T [0, i - 1])\) for each character \(\sigma \in \Sigma\); next, \(\left\lfloor \frac{\#_{\sigma_\ell} (T [0, i - 1])}{\#_{\sigma_\ell (P)}} \right\rfloor - \left\lfloor \frac{\#_{\sigma_{\ell + 1}} (T [0, i - 1])}{\#_{\sigma_{\ell + 1} (P)}} \right\rfloor\) for each consecutive pair of characters \(\sigma_\ell, \sigma_{\ell + 1} \in \Sigma\); and, finally, \(i\). The vectors for positions \(i\) and \(j\) with \(i < j\) agree on their prefixes of length \(2 |\Sigma| - 1\) if and only if \(T [i, j]\) is a permuted scaled occurrence of \(P\). This immediately yields an \(O (n |\Sigma|)\)-time algorithm based on computing the vectors naïvely and radix-sorting them. The third main idea is to speed up the computation of the vectors using deterministic renaming or randomized hashing and the fact that vectors for consecutive positions differ on \(O (1)\) components. Applying a result by \textit{T. Kociumaka} et al. [J. Comput. Syst. Sci. 84, 205--218 (2017; Zbl 1353.68225)] to group the vectors on the basis of their first \(2 |\Sigma| - 1\) components, then yields the deterministic and Monte Carlo time bounds stated above.
0 references
pattern matching
0 references
scaled matching
0 references
jumbled matching
0 references