A fast and recursive algorithm for clustering large datasets with k-medians
From MaRDI portal
Publication:434902
Abstract: Clustering with fast algorithms large samples of high dimensional data is an important challenge in computational statistics. Borrowing ideas from MacQueen (1967) who introduced a sequential version of the -means algorithm, a new class of recursive stochastic gradient algorithms designed for the -medians loss criterion is proposed. By their recursive nature, these algorithms are very fast and are well adapted to deal with large samples of data that are allowed to arrive sequentially. It is proved that the stochastic gradient algorithm converges almost surely to the set of stationary points of the underlying loss criterion. A particular attention is paid to the averaged versions, which are known to have better performances, and a data-driven procedure that allows automatic selection of the value of the descent step is proposed. The performance of the averaged sequential estimator is compared on a simulation study, both in terms of computation speed and accuracy of the estimations, with more classical partitioning techniques such as -means, trimmed -means and PAM (partitioning around medoids). Finally, this new online clustering technique is illustrated on determining television audience profiles with a sample of more than 5000 individual television audiences measured every minute over a period of 24 hours.
Recommendations
- Near-optimal large-scale k-medoids clustering
- scientific article; zbMATH DE number 2090269
- Learning Theory
- Fast recursive and efficient algorithms for estimating the functional median and robust clustering in large dimension
- On coresets for k-means and k-median clustering
- A fast clustering algorithm for large-scale and high dimensional data
- Algorithms for \(k\)-median clustering over distributed streams
- A framework for statistical clustering with constant time approximation algorithms for \(K\)-median and \(K\)-means clustering
- \(k\)-median clustering under discrete Fréchet and Hausdorff distances
Cites work
- scientific article; zbMATH DE number 976356 (Why is no real title available?)
- scientific article; zbMATH DE number 3340881 (Why is no real title available?)
- A general trimming approach to robust cluster analysis
- A review of robust clustering methods
- Acceleration of Stochastic Approximation by Averaging
- Almost sure convergence of stochastic gradient processes with matrix step sizes
- Asymptotic Almost Sure Efficiency of Averaged Stochastic Algorithms
- Asymptotic behaviour of classification maximum likelihood estimates
- Data Clustering: Theory, Algorithms, and Applications
- Editorial: Machine learning and robust data mining
- Finding Groups in Data
- Hybrid hierarchical clustering with applications to microarray data
- Large-scale machine learning with stochastic gradient descent
- On a Geometric Notion of Quantiles for Multivariate Data
- Online wavelet-based density estimation for non-stationary streaming data
- Printer graphics for clustering
- Robustness Properties of k Means and Trimmed k Means
- Stochastic approximation for multivariate and functional median
- \(L_1\)-quantization and clustering in Banach spaces
Cited in
(16)- Metric \(k\)-median clustering in insertion-only streams
- Fast recursive and efficient algorithms for estimating the functional median and robust clustering in large dimension
- Multiobjective semisupervised learning with a right‐censored endpoint adapted to the multiple imputation framework
- Bayesian quadrature, energy minimization, and space-filling design
- Clustering transformed compositional data using \(K\)-means, with applications in gene expression and bicycle sharing system data
- Efficient and fast estimation of the geometric median in Hilbert spaces with an averaged stochastic gradient algorithm
- Fast estimation of the median covariation matrix with application to online robust principal components analysis
- Recursive estimators of integrated squared density derivatives
- Widening the scope of an eigenvector stochastic approximation process and application to streaming PCA and related methods
- Online stochastic Newton methods for estimating the geometric median and applications
- A compact law of the iterated logarithm for online estimator of hazard rate under random censoring
- Estimating the geometric median in Hilbert spaces with stochastic gradient algorithms: \(L^p\) and almost sure rates of convergence
- Online estimation of integrated squared density derivatives
- Streaming constrained binary logistic regression with online standardized data
- Online estimation of hazard rate under random censoring
- A quasi-Bayesian perspective to online clustering
This page was built for publication: A fast and recursive algorithm for clustering large datasets with \(k\)-medians
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q434902)