Alignment-free sequence comparison using absent words
From MaRDI portal
Publication:1784946
DOI10.1016/J.IC.2018.06.002zbMATH Open1400.68264arXiv1806.02718OpenAlexW2805439715WikidataQ129623427 ScholiaQ129623427MaRDI QIDQ1784946FDOQ1784946
Authors: Panagiotis Charalampopoulos, Maxime Crochemore, Gabriele Fici, Robert Mercaş, Solon P. Pissis
Publication date: 27 September 2018
Published in: Information and Computation (Search for Journal in Brave)
Abstract: Sequence comparison is a prerequisite to virtually all comparative genomic analyses. It is often realised by sequence alignment techniques, which are computationally expensive. This has led to increased research into alignment-free techniques, which are based on measures referring to the composition of sequences in terms of their constituent patterns. These measures, such as -gram distance, are usually computed in time linear with respect to the length of the sequences. In this paper, we focus on the complementary idea: how two sequences can be efficiently compared based on information that does not occur in the sequences. A word is an {em absent word} of some sequence if it does not occur in the sequence. An absent word is {em minimal} if all its proper factors occur in the sequence. Here we present the first linear-time and linear-space algorithm to compare two sequences by considering {em all} their minimal absent words. In the process, we present results of combinatorial interest, and also extend the proposed techniques to compare circular sequences. We also present an algorithm that, given a word of length , computes the largest integer for which all factors of of that length occur in some minimal absent word of in time and space . Finally, we show that the known asymptotic upper bound on the number of minimal absent words of a word is tight.
Full work available at URL: https://arxiv.org/abs/1806.02718
Recommendations
- Linear-Time Sequence Comparison Using Minimal Absent Words & Applications
- Efficient computation of shortest absent words in complete genomes
- Using minimal absent words to build phylogeny
- Building phylogeny with minimal absent words
- Efficient computation of shortest absent words in a genomic sequence
Cites Work
- Comparison of phylogenetic trees
- Storing a Sparse Table with 0 (1) Worst Case Access Time
- A linear-time algorithm for a special case of disjoint set union
- Space-efficient preprocessing schemes for range minimum queries on static arrays
- Suffix Arrays: A New Method for On-Line String Searches
- On a cyclic string-to-string correction problem
- Words and forbidden factors
- Versatile succinct representations of the bidirectional Burrows-Wheeler transform
- Automata and forbidden words
- Algorithms on Strings
- The longest common extension problem revisited and applications to approximate string searching
- Forbidden words in symbolic dynamics
- Inducing the LCP-array
- An Eulerian path approach to DNA fragment assembly
- Using minimal absent words to build phylogeny
- Word assembly through minimal forbidden words
- Minimum unique substrings and maximum repeats
- Accurate and efficient methods to improve multiple circular sequence alignment
- Linear-Time Sequence Comparison Using Minimal Absent Words & Applications
- Minimal absent words in a sliding window and applications to on-line pattern matching
- Indexing weighted sequences: neat and efficient
Cited In (28)
- Alignment-free sequence comparison with spaced k-mers.
- Linear-Time Sequence Comparison Using Minimal Absent Words & Applications
- On overabundant words and their application to biological sequence analysis
- Building phylogeny with minimal absent words
- Alignment free comparison: similarity distribution between the DNA primary sequences based on the shortest absent word
- Absent words in a sliding window with applications
- Some Investigations on Similarity Measures Based on Absent Words
- Absent subsequences in words
- Circular sequence comparison with \(q\)-grams
- Using minimal absent words to build phylogeny
- An estimator for local analysis of genome based on the minimal absent word
- Extraction of high quality \(k\)-words for alignment-free sequence comparison
- Efficient computation of shortest absent words in a genomic sequence
- Fast detection of specific fragments against a set of sequences
- Isometric words based on swap and mismatch distance
- Reverse-safe text indexing
- Absent Subsequences in Words
- Combinatorics of minimal absent words for a sliding window
- Constructing antidictionaries of long texts in output-sensitive space
- Comparison of Symbol Sequences: No Editing, No Alignment
- Internal shortest absent word queries in constant time and linear space
- Linear-time computation of DAWGs, symmetric indexing structures, and MAWs for integer alphabets
- Minimal absent words in a sliding window and applications to on-line pattern matching
- Density of \(k\)-ary words with 0, 1, 2-error overlaps
- Minimal absent words in rooted and unrooted trees
- Efficient computation of shortest absent words in complete genomes
- Linear-time computation of generalized minimal absent words for multiple strings
- Isometric words and edit distance: main notions and new variations
Uses Software
This page was built for publication: Alignment-free sequence comparison using absent words
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q1784946)