Alignment-free sequence comparison using absent words
From MaRDI portal
Abstract: Sequence comparison is a prerequisite to virtually all comparative genomic analyses. It is often realised by sequence alignment techniques, which are computationally expensive. This has led to increased research into alignment-free techniques, which are based on measures referring to the composition of sequences in terms of their constituent patterns. These measures, such as -gram distance, are usually computed in time linear with respect to the length of the sequences. In this paper, we focus on the complementary idea: how two sequences can be efficiently compared based on information that does not occur in the sequences. A word is an {em absent word} of some sequence if it does not occur in the sequence. An absent word is {em minimal} if all its proper factors occur in the sequence. Here we present the first linear-time and linear-space algorithm to compare two sequences by considering {em all} their minimal absent words. In the process, we present results of combinatorial interest, and also extend the proposed techniques to compare circular sequences. We also present an algorithm that, given a word of length , computes the largest integer for which all factors of of that length occur in some minimal absent word of in time and space . Finally, we show that the known asymptotic upper bound on the number of minimal absent words of a word is tight.
Recommendations
- Linear-Time Sequence Comparison Using Minimal Absent Words & Applications
- Efficient computation of shortest absent words in complete genomes
- Using minimal absent words to build phylogeny
- Building phylogeny with minimal absent words
- Efficient computation of shortest absent words in a genomic sequence
Cites work
- A linear-time algorithm for a special case of disjoint set union
- Accurate and efficient methods to improve multiple circular sequence alignment
- Algorithms on Strings
- An Eulerian path approach to DNA fragment assembly
- Automata and forbidden words
- Comparison of phylogenetic trees
- Forbidden words in symbolic dynamics
- Indexing weighted sequences: neat and efficient
- Inducing the LCP-array
- Linear-Time Sequence Comparison Using Minimal Absent Words & Applications
- Minimal absent words in a sliding window and applications to on-line pattern matching
- Minimum unique substrings and maximum repeats
- On a cyclic string-to-string correction problem
- Space-efficient preprocessing schemes for range minimum queries on static arrays
- Storing a Sparse Table with 0 (1) Worst Case Access Time
- Suffix Arrays: A New Method for On-Line String Searches
- The longest common extension problem revisited and applications to approximate string searching
- Using minimal absent words to build phylogeny
- Versatile succinct representations of the bidirectional Burrows-Wheeler transform
- Word assembly through minimal forbidden words
- Words and forbidden factors
Cited in
(28)- Building phylogeny with minimal absent words
- Absent subsequences in words
- Efficient computation of shortest absent words in complete genomes
- Combinatorics of minimal absent words for a sliding window
- Density of \(k\)-ary words with 0, 1, 2-error overlaps
- Fast detection of specific fragments against a set of sequences
- Isometric words based on swap and mismatch distance
- Linear-time computation of DAWGs, symmetric indexing structures, and MAWs for integer alphabets
- Alignment free comparison: similarity distribution between the DNA primary sequences based on the shortest absent word
- Linear-time computation of generalized minimal absent words for multiple strings
- Isometric words and edit distance: main notions and new variations
- Using minimal absent words to build phylogeny
- Circular sequence comparison with \(q\)-grams
- Alignment-free sequence comparison with spaced k-mers.
- On overabundant words and their application to biological sequence analysis
- Linear-Time Sequence Comparison Using Minimal Absent Words & Applications
- Minimal absent words in a sliding window and applications to on-line pattern matching
- Minimal absent words in rooted and unrooted trees
- An estimator for local analysis of genome based on the minimal absent word
- Comparison of Symbol Sequences: No Editing, No Alignment
- Absent words in a sliding window with applications
- Efficient computation of shortest absent words in a genomic sequence
- Internal shortest absent word queries in constant time and linear space
- Some Investigations on Similarity Measures Based on Absent Words
- Extraction of high quality \(k\)-words for alignment-free sequence comparison
- Reverse-safe text indexing
- Constructing antidictionaries of long texts in output-sensitive space
- Absent Subsequences in Words
This page was built for publication: Alignment-free sequence comparison using absent words
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q1784946)