On overabundant words and their application to biological sequence analysis (Q2326388)

scientific article; zbMATH DE number 7114316

Language	Label	Description	Also known as
default for all languages	No label defined
English	On overabundant words and their application to biological sequence analysis	scientific article; zbMATH DE number 7114316

Statements

instance of

scholarly article

0 references

title

On overabundant words and their application to biological sequence analysis (English)

0 references

author

Panagiotis Charalampopoulos

0 references

0 references

0 references

0 references

0 references

Dimitris Polychronopoulos

0 references

Yannis Almirantis

0 references

published in

Theoretical Computer Science

0 references

publication date

7 October 2019

0 references

full work available at URL

https://kclpure.kcl.ac.uk/portal/en/publications/on-overabundant-words-and-their-application-to-biological-sequence-analysis(ec2b4dfd-49cb-4be3-bc3d-7cc48325339e).html

0 references

review text

The authors use the biologically justified model introduced by \textit{V. Brendel} et al. [``Linguistics of nucleotide sequences: morphology and comparison of vocabularies'', J. Biomol. Struct. Dyn. 4, No. 1, 11--21 (1986; \url{doi:10.1080/07391102.1986.10507643})]. They prove non-trivial combinatorial properties and establish that it admits efficient computation for overabundant words as well. The results in this article extend their former study by presenting an \(O(n)\)-time and \(O(n)\)-space algorithm for computing all overabundant words in a sequence \(x\) of length \(n\) over an integer alphabet. They also present experimental results, using both synthetic and real data, which further highlight the effectiveness of this model. The computational problem can be described as follows. Given a sequence \(x\) of length \(n\) and a real number \(\rho >0\), compute the set of \(\rho\)-overabundant words, i.e., all words w for which \(dev( w)\ge \rho\), by presenting an \(O(n)\)-time and \(O(n)\)-space algorithm for computing all \(\rho\)-overabundant words (of any length) in a sequence \(x\) of length \(n\) over an integer alphabet. This result is based on a combinatorial property of the suffix tree \(T\) of \(x\) that are proved here: the number of distinct factors of \(x\) whose longest infix is the label of an explicit node of \(T\) is no more than \(3n-4\). Further is demonstrated that the presented algorithm is time-optimal by proving that \(O(n)\) is a tight upper bound for the number of \(\rho\)-overabundant words.

0 references

reviewed by

Andrey Zahariev

0 references

zbMATH Keywords

overabundant words

0 references

avoided words

0 references

pattern matching

0 references

suffix tree