Approximate all-pairs suffix/prefix overlaps (Q418172)

Motivated by problems in sequence assembly, the authors give space-efficient algorithms for finding all approximate prefix/suffix overlaps. Specifically, they show how, given a set of \(r\) strings with total length \(n\), a minimum length \(t\) and a maximum error-rate \(\epsilon\), we can use an FM-index to find every overlap with length \(\ell \geq t\) between a suffix of one string and a prefix of another, for which the edit distance between the suffix and prefix is at most \(\lceil \epsilon \ell \rceil\). This takes \(n H + o (n \log \sigma) + r \log_2 r\) bits of space, where \(H \leq \log_2 \sigma\) is the empirical entropy of the set of strings and \(\sigma\) is the size of the alphabet. The authors first describe how we can use a procedure they call backward backtracking to find every suffix/prefix overlap for which the Hamming distance between the suffix and prefix is at most a given bound \(k\). To do this, for each string \(T^i\) in the set we perform a recursively branching backward search with the FM-index for strings within Hamming distance \(k\) of a suffix of \(T^i\); we report an overlap whenever the current interval in the FM-index contains an end-of-string symbol, and we stop recursing whenever the current interval becomes empty. This search takes a total of \(O (\sigma^k t_{\mathsf{LF}} \sum_i |T^i|^{k + 1} + r*)\) time in the worst case, where \(t_{\mathsf{LF}}\) is the time for a backward-step in the FM-index and \(r*\) is the number of overlaps reported. The authors note that setting \(k = 0\) gives a space-efficient method for finding exact overlaps, and that their algorithm can be modified to find overlaps with edit distance at most \(k\). The authors' main result is their algorithm based on suffix filters, which were introduced by \textit{J. Kärkkäinen} and \textit{J. C. Na} [``Faster filters for approximate string matching'', in: Proceedings of the 9th workshop on algorithm engineering and experiments (ALENEX). Philadelphia, PA: Society for Industrial and Applied Mathematics (SIAM). 84--90 (2007)] for approximate pattern matching. For each string \(T^i\) in the set, we divide \(T^i\) into factors of length \(p = \min_{t \leq \ell \leq |T^i|} \left\{ \left\lceil \frac{\ell}{\lceil \epsilon \ell \rceil + 1} \right\rceil \right\}\), except that the last factor can be shorter. We first use backward backtracking to check whether \(T^i\) is sufficiently close to any prefix of any string in the set. Then, for \(j\) from 2 to \(\lceil |T^i| / p \rceil\), we use backward backtracking to check whether there is any string in the set such that, for \(j \leq j' \leq \lceil |T^i| / p \rceil\), the concatenation of the \(j\)th through \(j'\)th factors of \(T^i\) is sufficiently close to a corresponding substring of that string. Finally, one has to validate any candidate overlaps this procedure returns. Overall, one needs only \(n H + o (n \log \sigma) + r \log_2 r\) bits of space, although the authors cannot guarantee any interesting worst-case time bounds. The authors' final contribution is a preliminary experimental comparison of their algorithms to each other and to an algorithm by \textit{K.~R. Rasmussen, J. Stoye} and \textit{E. W. Myers} [Lect. Notes Comput. Sci. 3500, 189--203 (2006; Zbl 1119.92329)], which finds all sufficiently close matches over a given length and not only suffix/prefix overlaps. The authors used access to a plain-text version of the strings to check candidate overlaps, increasing their space bound by about \(n \log_2 \sigma\) bits. They report that Rasmussen et al.'s algorithm can be made faster but at the cost of using four to five times more space.

0 references

reviewed by

Travis Gagie

0 references

zbMATH Keywords

suffix/prefix matching

0 references

approximate pattern matching

0 references

describes a project that uses

Velvet

0 references