Artificial sequences and complexity measures
From MaRDI portal
Publication:4968837
Analysis of algorithms (68W40) Information theory (general) (94A15) Measures of information, entropy (94A17) Coding and information theory (compaction, compression, models of communication, encoding schemes, etc.) (aspects in computer science) (68P30) Algorithmic information theory (Kolmogorov complexity, etc.) (68Q30) Time-dependent statistical mechanics (dynamic and nonequilibrium) (82C99)
Abstract: In this paper we exploit concepts of information theory to address the fundamental problem of identifying and defining the most suitable tools to extract, in a automatic and agnostic way, information from a generic string of characters. We introduce in particular a class of methods which use in a crucial way data compression techniques in order to define a measure of remoteness and distance between pairs of sequences of characters (e.g. texts) based on their relative information content. We also discuss in detail how specific features of data compression techniques could be used to introduce the notion of dictionary of a given sequence and of Artificial Text and we show how these new tools can be used for information extraction purposes. We point out the versatility and generality of our method that applies to any kind of corpora of character strings independently of the type of coding behind them. We consider as a case study linguistic motivated problems and we present results for automatic language recognition, authorship attribution and self consistent-classification.
Recommendations
Cites work
- scientific article; zbMATH DE number 3427210 (Why is no real title available?)
- scientific article; zbMATH DE number 3140805 (Why is no real title available?)
- scientific article; zbMATH DE number 3143967 (Why is no real title available?)
- scientific article; zbMATH DE number 3143969 (Why is no real title available?)
- scientific article; zbMATH DE number 107482 (Why is no real title available?)
- scientific article; zbMATH DE number 176034 (Why is no real title available?)
- scientific article; zbMATH DE number 1010621 (Why is no real title available?)
- scientific article; zbMATH DE number 910859 (Why is no real title available?)
- A Mathematical Theory of Communication
- A formal theory of inductive inference. Part I
- A formal theory of inductive inference. Part II
- A measure of relative entropy between individual sequences with application to universal classification
- A new challenge for compression algorithms: Genetic sequences
- A universal algorithm for sequential data compression
- An Introduction to Symbolic Dynamics and Coding
- Analysis of symbolic sequences using the Jensen-Shannon divergence
- Clustering by Compression
- Data compression and learning in time sequences analysis
- Dynamical systems and computable information
- Entropy estimation of symbol sequences
- Ergodic theory of chaos and strange attractors
- Fifty years of Shannon theory
- Information distance
- Nonparametric entropy estimation for stationary processes and random fields, with applications to English text
- On Information and Sufficiency
- On the Complexity of Finite Sequences
- On the Length of Programs for Computing Finite Binary Sequences
- Predictability: a way to characterize complexity
- The Similarity Metric
- Using literal and grammatical statistics for authorship attribution
Cited in
(3)
This page was built for publication: Artificial sequences and complexity measures
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q4968837)