Approximate all-pairs suffix/prefix overlaps (Q418172): Difference between revisions

From MaRDI portal
Added link to MaRDI item.
ReferenceBot (talk | contribs)
Changed an Item
(5 intermediate revisions by 3 users not shown)
Property / describes a project that uses
 
Property / describes a project that uses: Velvet / rank
 
Normal rank
Property / describes a project that uses
 
Property / describes a project that uses: Soap / rank
 
Normal rank
Property / describes a project that uses
 
Property / describes a project that uses: BWA / rank
 
Normal rank
Property / MaRDI profile type
 
Property / MaRDI profile type: MaRDI publication profile / rank
 
Normal rank
Property / full work available at URL
 
Property / full work available at URL: https://doi.org/10.1016/j.ic.2012.02.002 / rank
 
Normal rank
Property / OpenAlex ID
 
Property / OpenAlex ID: W2062266313 / rank
 
Normal rank
Property / cites work
 
Property / cites work: Dictionary matching and indexing with errors and don't cares / rank
 
Normal rank
Property / cites work
 
Property / cites work: Indexing compressed text / rank
 
Normal rank
Property / cites work
 
Property / cites work: Compressed representations of sequences and full-text indexes / rank
 
Normal rank
Property / cites work
 
Property / cites work: Algorithms on Strings, Trees and Sequences / rank
 
Normal rank
Property / cites work
 
Property / cites work: Bit-parallel witnesses and their applications to approximate string matching / rank
 
Normal rank
Property / cites work
 
Property / cites work: Q4035246 / rank
 
Normal rank
Property / cites work
 
Property / cites work: Faster Filters for Approximate String Matching / rank
 
Normal rank
Property / cites work
 
Property / cites work: Combinatorial algorithms for DNA sequence assembly / rank
 
Normal rank
Property / cites work
 
Property / cites work: Q5528329 / rank
 
Normal rank
Property / cites work
 
Property / cites work: Unified View of Backward Backtracking in Short Read Mapping / rank
 
Normal rank
Property / cites work
 
Property / cites work: Dynamic Entropy-Compressed Sequences and Full-Text Indexes / rank
 
Normal rank
Property / cites work
 
Property / cites work: Suffix Arrays: A New Method for On-Line String Searches / rank
 
Normal rank
Property / cites work
 
Property / cites work: A fast bit-vector algorithm for approximate string matching based on dynamic programming / rank
 
Normal rank
Property / cites work
 
Property / cites work: Efficient algorithms for the all-pairs suffix-prefix problem and the all-pairs substring-prefix problem / rank
 
Normal rank
Property / cites work
 
Property / cites work: An Eulerian path approach to DNA fragment assembly / rank
 
Normal rank
Property / cites work
 
Property / cites work: The theory and computation of evolutionary distances: Pattern recognition / rank
 
Normal rank
Property / cites work
 
Property / cites work: Approximate All-Pairs Suffix/Prefix Overlaps / rank
 
Normal rank

Revision as of 06:13, 5 July 2024

scientific article
Language Label Description Also known as
English
Approximate all-pairs suffix/prefix overlaps
scientific article

    Statements

    Approximate all-pairs suffix/prefix overlaps (English)
    0 references
    0 references
    0 references
    0 references
    24 May 2012
    0 references
    Motivated by problems in sequence assembly, the authors give space-efficient algorithms for finding all approximate prefix/suffix overlaps. Specifically, they show how, given a set of \(r\) strings with total length \(n\), a minimum length \(t\) and a maximum error-rate \(\epsilon\), we can use an FM-index to find every overlap with length \(\ell \geq t\) between a suffix of one string and a prefix of another, for which the edit distance between the suffix and prefix is at most \(\lceil \epsilon \ell \rceil\). This takes \(n H + o (n \log \sigma) + r \log_2 r\) bits of space, where \(H \leq \log_2 \sigma\) is the empirical entropy of the set of strings and \(\sigma\) is the size of the alphabet. The authors first describe how we can use a procedure they call backward backtracking to find every suffix/prefix overlap for which the Hamming distance between the suffix and prefix is at most a given bound \(k\). To do this, for each string \(T^i\) in the set we perform a recursively branching backward search with the FM-index for strings within Hamming distance \(k\) of a suffix of \(T^i\); we report an overlap whenever the current interval in the FM-index contains an end-of-string symbol, and we stop recursing whenever the current interval becomes empty. This search takes a total of \(O (\sigma^k t_{\mathsf{LF}} \sum_i |T^i|^{k + 1} + r*)\) time in the worst case, where \(t_{\mathsf{LF}}\) is the time for a backward-step in the FM-index and \(r*\) is the number of overlaps reported. The authors note that setting \(k = 0\) gives a space-efficient method for finding exact overlaps, and that their algorithm can be modified to find overlaps with edit distance at most \(k\). The authors' main result is their algorithm based on suffix filters, which were introduced by \textit{J. Kärkkäinen} and \textit{J. C. Na} [``Faster filters for approximate string matching'', in: Proceedings of the 9th workshop on algorithm engineering and experiments (ALENEX). Philadelphia, PA: Society for Industrial and Applied Mathematics (SIAM). 84--90 (2007)] for approximate pattern matching. For each string \(T^i\) in the set, we divide \(T^i\) into factors of length \(p = \min_{t \leq \ell \leq |T^i|} \left\{ \left\lceil \frac{\ell}{\lceil \epsilon \ell \rceil + 1} \right\rceil \right\}\), except that the last factor can be shorter. We first use backward backtracking to check whether \(T^i\) is sufficiently close to any prefix of any string in the set. Then, for \(j\) from 2 to \(\lceil |T^i| / p \rceil\), we use backward backtracking to check whether there is any string in the set such that, for \(j \leq j' \leq \lceil |T^i| / p \rceil\), the concatenation of the \(j\)th through \(j'\)th factors of \(T^i\) is sufficiently close to a corresponding substring of that string. Finally, one has to validate any candidate overlaps this procedure returns. Overall, one needs only \(n H + o (n \log \sigma) + r \log_2 r\) bits of space, although the authors cannot guarantee any interesting worst-case time bounds. The authors' final contribution is a preliminary experimental comparison of their algorithms to each other and to an algorithm by \textit{K.~R. Rasmussen, J. Stoye} and \textit{E. W. Myers} [Lect. Notes Comput. Sci. 3500, 189--203 (2006; Zbl 1119.92329)], which finds all sufficiently close matches over a given length and not only suffix/prefix overlaps. The authors used access to a plain-text version of the strings to check candidate overlaps, increasing their space bound by about \(n \log_2 \sigma\) bits. They report that Rasmussen et al.'s algorithm can be made faster but at the cost of using four to five times more space.
    0 references
    0 references
    suffix/prefix matching
    0 references
    approximate pattern matching
    0 references
    0 references
    0 references
    0 references

    Identifiers