Approximate all-pairs suffix/prefix overlaps (Q418172): Difference between revisions

Motivated by problems in sequence assembly, the authors give space-efficient algorithms for finding all approximate prefix/suffix overlaps. Specifically, they show how, given a set of \(r\) strings with total length \(n\), a minimum length \(t\) and a maximum error-rate \(\epsilon\), we can use an FM-index to find every overlap with length \(\ell \geq t\) between a suffix of one string and a prefix of another, for which the edit distance between the suffix and prefix is at most \(\lceil \epsilon \ell \rceil\). This takes \(n H + o (n \log \sigma) + r \log_2 r\) bits of space, where \(H \leq \log_2 \sigma\) is the empirical entropy of the set of strings and \(\sigma\) is the size of the alphabet. The authors first describe how we can use a procedure they call backward backtracking to find every suffix/prefix overlap for which the Hamming distance between the suffix and prefix is at most a given bound \(k\). To do this, for each string \(T^i\) in the set we perform a recursively branching backward search with the FM-index for strings within Hamming distance \(k\) of a suffix of \(T^i\); we report an overlap whenever the current interval in the FM-index contains an end-of-string symbol, and we stop recursing whenever the current interval becomes empty. This search takes a total of \(O (\sigma^k t_{\mathsf{LF}} \sum_i |T^i|^{k + 1} + r*)\) time in the worst case, where \(t_{\mathsf{LF}}\) is the time for a backward-step in the FM-index and \(r*\) is the number of overlaps reported. The authors note that setting \(k = 0\) gives a space-efficient method for finding exact overlaps, and that their algorithm can be modified to find overlaps with edit distance at most \(k\). The authors' main result is their algorithm based on suffix filters, which were introduced by \textit{J. Kärkkäinen} and \textit{J. C. Na} [``Faster filters for approximate string matching'', in: Proceedings of the 9th workshop on algorithm engineering and experiments (ALENEX). Philadelphia, PA: Society for Industrial and Applied Mathematics (SIAM). 84--90 (2007)] for approximate pattern matching. For each string \(T^i\) in the set, we divide \(T^i\) into factors of length \(p = \min_{t \leq \ell \leq |T^i|} \left\{ \left\lceil \frac{\ell}{\lceil \epsilon \ell \rceil + 1} \right\rceil \right\}\), except that the last factor can be shorter. We first use backward backtracking to check whether \(T^i\) is sufficiently close to any prefix of any string in the set. Then, for \(j\) from 2 to \(\lceil |T^i| / p \rceil\), we use backward backtracking to check whether there is any string in the set such that, for \(j \leq j' \leq \lceil |T^i| / p \rceil\), the concatenation of the \(j\)th through \(j'\)th factors of \(T^i\) is sufficiently close to a corresponding substring of that string. Finally, one has to validate any candidate overlaps this procedure returns. Overall, one needs only \(n H + o (n \log \sigma) + r \log_2 r\) bits of space, although the authors cannot guarantee any interesting worst-case time bounds. The authors' final contribution is a preliminary experimental comparison of their algorithms to each other and to an algorithm by \textit{K.~R. Rasmussen, J. Stoye} and \textit{E. W. Myers} [Lect. Notes Comput. Sci. 3500, 189--203 (2006; Zbl 1119.92329)], which finds all sufficiently close matches over a given length and not only suffix/prefix overlaps. The authors used access to a plain-text version of the strings to check candidate overlaps, increasing their space bound by about \(n \log_2 \sigma\) bits. They report that Rasmussen et al.'s algorithm can be made faster but at the cost of using four to five times more space.

0 references

reviewed by

Travis Gagie

0 references

zbMATH Keywords

suffix/prefix matching

0 references

approximate pattern matching

0 references

describes a project that uses

0 references

0 references

0 references

MaRDI publication profile

0 references

full work available at URL

https://doi.org/10.1016/j.ic.2012.02.002

0 references

cites work

Dictionary matching and indexing with errors and don't cares

0 references

Indexing compressed text

0 references

Compressed representations of sequences and full-text indexes

0 references

Algorithms on Strings, Trees and Sequences

0 references

Bit-parallel witnesses and their applications to approximate string matching

0 references

Q4035246

0 references

Faster Filters for Approximate String Matching

0 references

Combinatorial algorithms for DNA sequence assembly

0 references

Q5528329

0 references

Unified View of Backward Backtracking in Short Read Mapping

0 references

Dynamic Entropy-Compressed Sequences and Full-Text Indexes

0 references

Suffix Arrays: A New Method for On-Line String Searches

0 references

A fast bit-vector algorithm for approximate string matching based on dynamic programming

0 references

Efficient algorithms for the all-pairs suffix-prefix problem and the all-pairs substring-prefix problem

0 references

An Eulerian path approach to DNA fragment assembly

0 references

The theory and computation of evolutionary distances: Pattern recognition

0 references

Approximate All-Pairs Suffix/Prefix Overlaps

0 references

Identifiers

zbMATH Open document ID

1254.68361

0 references

DOI

10.1016/j.ic.2012.02.002

0 references

Mathematics Subject Classification ID

0 references

0 references

0 references

0 references

Sitelinks

Mathematics(1 entry)

mardi Publication:418172

@@ Property / review text @@
+Motivated by problems in sequence assembly, the authors give space-efficient algorithms for finding all approximate prefix/suffix overlaps. Specifically, they show how, given a set of \(r\) strings with total length \(n\), a minimum length \(t\) and a maximum error-rate \(\epsilon\), we can use an FM-index to find every overlap with length \(\ell \geq t\) between a suffix of one string and a prefix of another, for which the edit distance between the suffix and prefix is at most \(\lceil \epsilon \ell \rceil\). This takes \(n H + o (n \log \sigma) + r \log_2 r\) bits of space, where \(H \leq \log_2 \sigma\) is the empirical entropy of the set of strings and \(\sigma\) is the size of the alphabet.  The authors first describe how we can use a procedure they call backward backtracking to find every suffix/prefix overlap for which the Hamming distance between the suffix and prefix is at most a given bound \(k\). To do this, for each string \(T^i\) in the set we perform a recursively branching backward search with the FM-index for strings within Hamming distance \(k\) of a suffix of \(T^i\); we report an overlap whenever the current interval in the FM-index contains an end-of-string symbol, and we stop recursing whenever the current interval becomes empty. This search takes a total of \(O (\sigma^k t_{\mathsf{LF}} \sum_i |T^i|^{k + 1} + r*)\) time in the worst case, where \(t_{\mathsf{LF}}\) is the time for a backward-step in the FM-index and \(r*\) is the number of overlaps reported. The authors note that setting \(k = 0\) gives a space-efficient method for finding exact overlaps, and that their algorithm can be modified to find overlaps with edit distance at most \(k\).  The authors' main result is their algorithm based on suffix filters, which were introduced by \textit{J. Kärkkäinen} and \textit{J. C. Na} [``Faster filters for approximate string matching'', in: Proceedings of the 9th workshop on algorithm engineering and experiments (ALENEX). Philadelphia, PA: Society for Industrial and Applied Mathematics (SIAM). 84--90 (2007)] for approximate pattern matching. For each string \(T^i\) in the set, we divide \(T^i\) into factors of length \(p = \min_{t \leq \ell \leq |T^i|} \left\{ \left\lceil \frac{\ell}{\lceil \epsilon \ell \rceil + 1} \right\rceil \right\}\), except that the last factor can be shorter. We first use backward backtracking to check whether \(T^i\) is sufficiently close to any prefix of any string in the set. Then, for \(j\) from 2 to \(\lceil |T^i| / p \rceil\), we use backward backtracking to check whether there is any string in the set such that, for \(j \leq j' \leq \lceil |T^i| / p \rceil\), the concatenation of the \(j\)th through \(j'\)th factors of \(T^i\) is sufficiently close to a corresponding substring of that string. Finally, one has to validate any candidate overlaps this procedure returns. Overall, one needs only \(n H + o (n \log \sigma) + r \log_2 r\) bits of space, although the authors cannot guarantee any interesting worst-case time bounds.  The authors' final contribution is a preliminary experimental comparison of their algorithms to each other and to an algorithm by \textit{K.~R. Rasmussen, J. Stoye} and \textit{E. W. Myers} [Lect. Notes Comput. Sci. 3500, 189--203 (2006; Zbl 1119.92329)], which finds all sufficiently close matches over a given length and not only suffix/prefix overlaps. The authors used access to a plain-text version of the strings to check candidate overlaps, increasing their space bound by about \(n \log_2 \sigma\) bits. They report that Rasmussen et al.'s algorithm can be made faster but at the cost of using four to five times more space.
+Normal rank
@@ Property / reviewed by @@
+Travis Gagie
@@ Property / reviewed by: Travis Gagie / rank @@
+Normal rank
@@ Property / Mathematics Subject Classification ID @@
+W32
@@ Property / Mathematics Subject Classification ID: 68W32 / rank @@
+Normal rank
@@ Property / Mathematics Subject Classification ID @@
+D20
@@ Property / Mathematics Subject Classification ID: 92D20 / rank @@
+Normal rank
@@ Property / zbMATH DE Number @@
+6038307
@@ Property / zbMATH DE Number: 6038307 / rank @@
+Normal rank
@@ Property / zbMATH Keywords @@
+suffix/prefix matching
@@ Property / zbMATH Keywords: suffix/prefix matching / rank @@
+Normal rank
@@ Property / zbMATH Keywords @@
+approximate pattern matching
@@ Property / zbMATH Keywords: approximate pattern matching / rank @@
+Normal rank
@@ Property / describes a project that uses @@
+Velvet
@@ Property / describes a project that uses: Velvet / rank @@
+Normal rank
@@ Property / describes a project that uses @@
+Soap
@@ Property / describes a project that uses: Soap / rank @@
+Normal rank
@@ Property / describes a project that uses @@
+BWA
@@ Property / describes a project that uses: BWA / rank @@
+Normal rank
@@ Property / MaRDI profile type @@
+MaRDI publication profile
@@ Property / MaRDI profile type: MaRDI publication profile / rank @@
+Normal rank
@@ Property / full work available at URL @@
+https://doi.org/10.1016/j.ic.2012.02.002
+Normal rank
@@ Property / OpenAlex ID @@
+W2062266313
@@ Property / OpenAlex ID: W2062266313 / rank @@
+Normal rank
@@ Property / cites work @@
+Dictionary matching and indexing with errors and don't cares
+Normal rank
@@ Property / cites work @@
+Indexing compressed text
@@ Property / cites work: Indexing compressed text / rank @@
+Normal rank
@@ Property / cites work @@
+Compressed representations of sequences and full-text indexes
+Normal rank
@@ Property / cites work @@
+Algorithms on Strings, Trees and Sequences
@@ Property / cites work: Algorithms on Strings, Trees and Sequences / rank @@
+Normal rank
@@ Property / cites work @@
+Bit-parallel witnesses and their applications to approximate string matching
+Normal rank
@@ Property / cites work @@
+Q4035246
@@ Property / cites work: Q4035246 / rank @@
+Normal rank
@@ Property / cites work @@
+Faster Filters for Approximate String Matching
@@ Property / cites work: Faster Filters for Approximate String Matching / rank @@
+Normal rank
@@ Property / cites work @@
+Combinatorial algorithms for DNA sequence assembly
+Normal rank
@@ Property / cites work @@
+Q5528329
@@ Property / cites work: Q5528329 / rank @@
+Normal rank
@@ Property / cites work @@
+Unified View of Backward Backtracking in Short Read Mapping
+Normal rank
@@ Property / cites work @@
+Dynamic Entropy-Compressed Sequences and Full-Text Indexes
+Normal rank
@@ Property / cites work @@
+Suffix Arrays: A New Method for On-Line String Searches
+Normal rank
@@ Property / cites work @@
+A fast bit-vector algorithm for approximate string matching based on dynamic programming
+Normal rank
@@ Property / cites work @@
+Efficient algorithms for the all-pairs suffix-prefix problem and the all-pairs substring-prefix problem
+Normal rank
@@ Property / cites work @@
+An Eulerian path approach to DNA fragment assembly
+Normal rank
@@ Property / cites work @@
+The theory and computation of evolutionary distances: Pattern recognition
+Normal rank
@@ Property / cites work @@
+Approximate All-Pairs Suffix/Prefix Overlaps
@@ Property / cites work: Approximate All-Pairs Suffix/Prefix Overlaps / rank @@
+Normal rank
@@ links / mardi / name / links / mardi / name @@
+Publication:418172