How Many Clusters? An Information-Theoretic Perspective
From MaRDI portal
Publication:4664508
Abstract: Clustering provides a common means of identifying structure in complex data, and there is renewed interest in clustering as a tool for the analysis of large data sets in many fields. A natural question is how many clusters are appropriate for the description of a given system. Traditional approaches to this problem are based either on a framework in which clusters of a particular shape are assumed as a model of the system or on a two-step procedure in which a clustering criterion determines the optimal assignments for a given number of clusters and a separate criterion measures the goodness of the classification to determine the number of clusters. In a statistical mechanics approach, clustering can be seen as a trade--off between energy-- and entropy--like terms, with lower temperature driving the proliferation of clusters to provide a more detailed description of the data. For finite data sets, we expect that there is a limit to the meaningful structure that can be resolved and therefore a minimum temperature beyond which we will capture sampling noise. This suggests that correcting the clustering criterion for the bias which arises due to sampling errors will allow us to find a clustering solution at a temperature which is optimal in the sense that we capture maximal meaningful structure -- without having to define an external criterion for the goodness or stability of the clustering. We show that, in a general information theoretic framework, the finite size of a data set determines an optimal temperature, and we introduce a method for finding the maximal number of clusters which can be resolved from the data in the hard clustering limit.
Recommendations
Cites work
- A Mathematical Theory of Communication
- Estimating the number of clusters in a data set via the gap statistic
- Model-Based Clustering, Discriminant Analysis, and Density Estimation
- On stochastic complexity and nonparametric density estimation
- Stability-Based Validation of Clustering Solutions
- Statistical Inference, Occam's Razor, and Statistical Mechanics on the Space of Probability Distributions
Cited in
(12)- Enhancing the efficiency of a decision support system through the clustering of complex rule-based knowledge bases and modification of the inference algorithm
- A novel dynamic minimum spanning tree based clustering method for image mining
- Resampling approach for cluster model selection
- Algorithms of maximum likelihood data clustering with applications
- Replica analysis of Bayesian data clustering
- Clustering: how much bias do we need?
- Finding the Number of Clusters in a Dataset
- Optimal causal inference: estimating stored information and approximating causal architecture
- A Robust Information Clustering Algorithm
- Information Theoretical Clustering Is Hard to Approximate
- The information bottleneck and geometric clustering
- The deterministic information bottleneck
This page was built for publication: How Many Clusters? An Information-Theoretic Perspective
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q4664508)