Approximate all-pairs suffix/prefix overlaps (Q418172): Difference between revisions
From MaRDI portal
Created a new Item |
ReferenceBot (talk | contribs) Changed an Item |
||
(7 intermediate revisions by 5 users not shown) | |||
Property / review text | |||
Motivated by problems in sequence assembly, the authors give space-efficient algorithms for finding all approximate prefix/suffix overlaps. Specifically, they show how, given a set of \(r\) strings with total length \(n\), a minimum length \(t\) and a maximum error-rate \(\epsilon\), we can use an FM-index to find every overlap with length \(\ell \geq t\) between a suffix of one string and a prefix of another, for which the edit distance between the suffix and prefix is at most \(\lceil \epsilon \ell \rceil\). This takes \(n H + o (n \log \sigma) + r \log_2 r\) bits of space, where \(H \leq \log_2 \sigma\) is the empirical entropy of the set of strings and \(\sigma\) is the size of the alphabet. The authors first describe how we can use a procedure they call backward backtracking to find every suffix/prefix overlap for which the Hamming distance between the suffix and prefix is at most a given bound \(k\). To do this, for each string \(T^i\) in the set we perform a recursively branching backward search with the FM-index for strings within Hamming distance \(k\) of a suffix of \(T^i\); we report an overlap whenever the current interval in the FM-index contains an end-of-string symbol, and we stop recursing whenever the current interval becomes empty. This search takes a total of \(O (\sigma^k t_{\mathsf{LF}} \sum_i |T^i|^{k + 1} + r*)\) time in the worst case, where \(t_{\mathsf{LF}}\) is the time for a backward-step in the FM-index and \(r*\) is the number of overlaps reported. The authors note that setting \(k = 0\) gives a space-efficient method for finding exact overlaps, and that their algorithm can be modified to find overlaps with edit distance at most \(k\). The authors' main result is their algorithm based on suffix filters, which were introduced by \textit{J. Kärkkäinen} and \textit{J. C. Na} [``Faster filters for approximate string matching'', in: Proceedings of the 9th workshop on algorithm engineering and experiments (ALENEX). Philadelphia, PA: Society for Industrial and Applied Mathematics (SIAM). 84--90 (2007)] for approximate pattern matching. For each string \(T^i\) in the set, we divide \(T^i\) into factors of length \(p = \min_{t \leq \ell \leq |T^i|} \left\{ \left\lceil \frac{\ell}{\lceil \epsilon \ell \rceil + 1} \right\rceil \right\}\), except that the last factor can be shorter. We first use backward backtracking to check whether \(T^i\) is sufficiently close to any prefix of any string in the set. Then, for \(j\) from 2 to \(\lceil |T^i| / p \rceil\), we use backward backtracking to check whether there is any string in the set such that, for \(j \leq j' \leq \lceil |T^i| / p \rceil\), the concatenation of the \(j\)th through \(j'\)th factors of \(T^i\) is sufficiently close to a corresponding substring of that string. Finally, one has to validate any candidate overlaps this procedure returns. Overall, one needs only \(n H + o (n \log \sigma) + r \log_2 r\) bits of space, although the authors cannot guarantee any interesting worst-case time bounds. The authors' final contribution is a preliminary experimental comparison of their algorithms to each other and to an algorithm by \textit{K.~R. Rasmussen, J. Stoye} and \textit{E. W. Myers} [Lect. Notes Comput. Sci. 3500, 189--203 (2006; Zbl 1119.92329)], which finds all sufficiently close matches over a given length and not only suffix/prefix overlaps. The authors used access to a plain-text version of the strings to check candidate overlaps, increasing their space bound by about \(n \log_2 \sigma\) bits. They report that Rasmussen et al.'s algorithm can be made faster but at the cost of using four to five times more space. | |||
Property / review text: Motivated by problems in sequence assembly, the authors give space-efficient algorithms for finding all approximate prefix/suffix overlaps. Specifically, they show how, given a set of \(r\) strings with total length \(n\), a minimum length \(t\) and a maximum error-rate \(\epsilon\), we can use an FM-index to find every overlap with length \(\ell \geq t\) between a suffix of one string and a prefix of another, for which the edit distance between the suffix and prefix is at most \(\lceil \epsilon \ell \rceil\). This takes \(n H + o (n \log \sigma) + r \log_2 r\) bits of space, where \(H \leq \log_2 \sigma\) is the empirical entropy of the set of strings and \(\sigma\) is the size of the alphabet. The authors first describe how we can use a procedure they call backward backtracking to find every suffix/prefix overlap for which the Hamming distance between the suffix and prefix is at most a given bound \(k\). To do this, for each string \(T^i\) in the set we perform a recursively branching backward search with the FM-index for strings within Hamming distance \(k\) of a suffix of \(T^i\); we report an overlap whenever the current interval in the FM-index contains an end-of-string symbol, and we stop recursing whenever the current interval becomes empty. This search takes a total of \(O (\sigma^k t_{\mathsf{LF}} \sum_i |T^i|^{k + 1} + r*)\) time in the worst case, where \(t_{\mathsf{LF}}\) is the time for a backward-step in the FM-index and \(r*\) is the number of overlaps reported. The authors note that setting \(k = 0\) gives a space-efficient method for finding exact overlaps, and that their algorithm can be modified to find overlaps with edit distance at most \(k\). The authors' main result is their algorithm based on suffix filters, which were introduced by \textit{J. Kärkkäinen} and \textit{J. C. Na} [``Faster filters for approximate string matching'', in: Proceedings of the 9th workshop on algorithm engineering and experiments (ALENEX). Philadelphia, PA: Society for Industrial and Applied Mathematics (SIAM). 84--90 (2007)] for approximate pattern matching. For each string \(T^i\) in the set, we divide \(T^i\) into factors of length \(p = \min_{t \leq \ell \leq |T^i|} \left\{ \left\lceil \frac{\ell}{\lceil \epsilon \ell \rceil + 1} \right\rceil \right\}\), except that the last factor can be shorter. We first use backward backtracking to check whether \(T^i\) is sufficiently close to any prefix of any string in the set. Then, for \(j\) from 2 to \(\lceil |T^i| / p \rceil\), we use backward backtracking to check whether there is any string in the set such that, for \(j \leq j' \leq \lceil |T^i| / p \rceil\), the concatenation of the \(j\)th through \(j'\)th factors of \(T^i\) is sufficiently close to a corresponding substring of that string. Finally, one has to validate any candidate overlaps this procedure returns. Overall, one needs only \(n H + o (n \log \sigma) + r \log_2 r\) bits of space, although the authors cannot guarantee any interesting worst-case time bounds. The authors' final contribution is a preliminary experimental comparison of their algorithms to each other and to an algorithm by \textit{K.~R. Rasmussen, J. Stoye} and \textit{E. W. Myers} [Lect. Notes Comput. Sci. 3500, 189--203 (2006; Zbl 1119.92329)], which finds all sufficiently close matches over a given length and not only suffix/prefix overlaps. The authors used access to a plain-text version of the strings to check candidate overlaps, increasing their space bound by about \(n \log_2 \sigma\) bits. They report that Rasmussen et al.'s algorithm can be made faster but at the cost of using four to five times more space. / rank | |||
Normal rank | |||
Property / reviewed by | |||
Property / reviewed by: Travis Gagie / rank | |||
Normal rank | |||
Property / Mathematics Subject Classification ID | |||
Property / Mathematics Subject Classification ID: 68W32 / rank | |||
Normal rank | |||
Property / Mathematics Subject Classification ID | |||
Property / Mathematics Subject Classification ID: 92D20 / rank | |||
Normal rank | |||
Property / zbMATH DE Number | |||
Property / zbMATH DE Number: 6038307 / rank | |||
Normal rank | |||
Property / zbMATH Keywords | |||
suffix/prefix matching | |||
Property / zbMATH Keywords: suffix/prefix matching / rank | |||
Normal rank | |||
Property / zbMATH Keywords | |||
approximate pattern matching | |||
Property / zbMATH Keywords: approximate pattern matching / rank | |||
Normal rank | |||
Property / describes a project that uses | |||
Property / describes a project that uses: Velvet / rank | |||
Normal rank | |||
Property / describes a project that uses | |||
Property / describes a project that uses: Soap / rank | |||
Normal rank | |||
Property / describes a project that uses | |||
Property / describes a project that uses: BWA / rank | |||
Normal rank | |||
Property / MaRDI profile type | |||
Property / MaRDI profile type: MaRDI publication profile / rank | |||
Normal rank | |||
Property / full work available at URL | |||
Property / full work available at URL: https://doi.org/10.1016/j.ic.2012.02.002 / rank | |||
Normal rank | |||
Property / OpenAlex ID | |||
Property / OpenAlex ID: W2062266313 / rank | |||
Normal rank | |||
Property / cites work | |||
Property / cites work: Dictionary matching and indexing with errors and don't cares / rank | |||
Normal rank | |||
Property / cites work | |||
Property / cites work: Indexing compressed text / rank | |||
Normal rank | |||
Property / cites work | |||
Property / cites work: Compressed representations of sequences and full-text indexes / rank | |||
Normal rank | |||
Property / cites work | |||
Property / cites work: Algorithms on Strings, Trees and Sequences / rank | |||
Normal rank | |||
Property / cites work | |||
Property / cites work: Bit-parallel witnesses and their applications to approximate string matching / rank | |||
Normal rank | |||
Property / cites work | |||
Property / cites work: Q4035246 / rank | |||
Normal rank | |||
Property / cites work | |||
Property / cites work: Faster Filters for Approximate String Matching / rank | |||
Normal rank | |||
Property / cites work | |||
Property / cites work: Combinatorial algorithms for DNA sequence assembly / rank | |||
Normal rank | |||
Property / cites work | |||
Property / cites work: Q5528329 / rank | |||
Normal rank | |||
Property / cites work | |||
Property / cites work: Unified View of Backward Backtracking in Short Read Mapping / rank | |||
Normal rank | |||
Property / cites work | |||
Property / cites work: Dynamic Entropy-Compressed Sequences and Full-Text Indexes / rank | |||
Normal rank | |||
Property / cites work | |||
Property / cites work: Suffix Arrays: A New Method for On-Line String Searches / rank | |||
Normal rank | |||
Property / cites work | |||
Property / cites work: A fast bit-vector algorithm for approximate string matching based on dynamic programming / rank | |||
Normal rank | |||
Property / cites work | |||
Property / cites work: Efficient algorithms for the all-pairs suffix-prefix problem and the all-pairs substring-prefix problem / rank | |||
Normal rank | |||
Property / cites work | |||
Property / cites work: An Eulerian path approach to DNA fragment assembly / rank | |||
Normal rank | |||
Property / cites work | |||
Property / cites work: The theory and computation of evolutionary distances: Pattern recognition / rank | |||
Normal rank | |||
Property / cites work | |||
Property / cites work: Approximate All-Pairs Suffix/Prefix Overlaps / rank | |||
Normal rank | |||
links / mardi / name | links / mardi / name | ||
Revision as of 06:13, 5 July 2024
scientific article
Language | Label | Description | Also known as |
---|---|---|---|
English | Approximate all-pairs suffix/prefix overlaps |
scientific article |
Statements
Approximate all-pairs suffix/prefix overlaps (English)
0 references
24 May 2012
0 references
Motivated by problems in sequence assembly, the authors give space-efficient algorithms for finding all approximate prefix/suffix overlaps. Specifically, they show how, given a set of \(r\) strings with total length \(n\), a minimum length \(t\) and a maximum error-rate \(\epsilon\), we can use an FM-index to find every overlap with length \(\ell \geq t\) between a suffix of one string and a prefix of another, for which the edit distance between the suffix and prefix is at most \(\lceil \epsilon \ell \rceil\). This takes \(n H + o (n \log \sigma) + r \log_2 r\) bits of space, where \(H \leq \log_2 \sigma\) is the empirical entropy of the set of strings and \(\sigma\) is the size of the alphabet. The authors first describe how we can use a procedure they call backward backtracking to find every suffix/prefix overlap for which the Hamming distance between the suffix and prefix is at most a given bound \(k\). To do this, for each string \(T^i\) in the set we perform a recursively branching backward search with the FM-index for strings within Hamming distance \(k\) of a suffix of \(T^i\); we report an overlap whenever the current interval in the FM-index contains an end-of-string symbol, and we stop recursing whenever the current interval becomes empty. This search takes a total of \(O (\sigma^k t_{\mathsf{LF}} \sum_i |T^i|^{k + 1} + r*)\) time in the worst case, where \(t_{\mathsf{LF}}\) is the time for a backward-step in the FM-index and \(r*\) is the number of overlaps reported. The authors note that setting \(k = 0\) gives a space-efficient method for finding exact overlaps, and that their algorithm can be modified to find overlaps with edit distance at most \(k\). The authors' main result is their algorithm based on suffix filters, which were introduced by \textit{J. Kärkkäinen} and \textit{J. C. Na} [``Faster filters for approximate string matching'', in: Proceedings of the 9th workshop on algorithm engineering and experiments (ALENEX). Philadelphia, PA: Society for Industrial and Applied Mathematics (SIAM). 84--90 (2007)] for approximate pattern matching. For each string \(T^i\) in the set, we divide \(T^i\) into factors of length \(p = \min_{t \leq \ell \leq |T^i|} \left\{ \left\lceil \frac{\ell}{\lceil \epsilon \ell \rceil + 1} \right\rceil \right\}\), except that the last factor can be shorter. We first use backward backtracking to check whether \(T^i\) is sufficiently close to any prefix of any string in the set. Then, for \(j\) from 2 to \(\lceil |T^i| / p \rceil\), we use backward backtracking to check whether there is any string in the set such that, for \(j \leq j' \leq \lceil |T^i| / p \rceil\), the concatenation of the \(j\)th through \(j'\)th factors of \(T^i\) is sufficiently close to a corresponding substring of that string. Finally, one has to validate any candidate overlaps this procedure returns. Overall, one needs only \(n H + o (n \log \sigma) + r \log_2 r\) bits of space, although the authors cannot guarantee any interesting worst-case time bounds. The authors' final contribution is a preliminary experimental comparison of their algorithms to each other and to an algorithm by \textit{K.~R. Rasmussen, J. Stoye} and \textit{E. W. Myers} [Lect. Notes Comput. Sci. 3500, 189--203 (2006; Zbl 1119.92329)], which finds all sufficiently close matches over a given length and not only suffix/prefix overlaps. The authors used access to a plain-text version of the strings to check candidate overlaps, increasing their space bound by about \(n \log_2 \sigma\) bits. They report that Rasmussen et al.'s algorithm can be made faster but at the cost of using four to five times more space.
0 references
suffix/prefix matching
0 references
approximate pattern matching
0 references
0 references
0 references