Optimal time bounds for approximate clustering
From MaRDI portal
Publication:703075
DOI10.1023/B:MACH.0000033114.18632.E0zbMATH Open1069.68091arXiv1301.0587MaRDI QIDQ703075FDOQ703075
Authors: Ramgopal R. Mettu, C. Greg Plaxton
Publication date: 19 January 2005
Published in: Machine Learning (Search for Journal in Brave)
Abstract: Clustering is a fundamental problem in unsupervised learning, and has been studied widely both as a problem of learning mixture models and as an optimization problem. In this paper, we study clustering with respect the emph{k-median} objective function, a natural formulation of clustering in which we attempt to minimize the average distance to cluster centers. One of the main contributions of this paper is a simple but powerful sampling technique that we call emph{successive sampling} that could be of independent interest. We show that our sampling procedure can rapidly identify a small set of points (of size just O(klog{n/k})) that summarize the input points for the purpose of clustering. Using successive sampling, we develop an algorithm for the k-median problem that runs in O(nk) time for a wide range of values of k and is guaranteed, with high probability, to return a solution with cost at most a constant factor times optimal. We also establish a lower bound of Omega(nk) on any randomized constant-factor approximation algorithm for the k-median problem that succeeds with even a negligible (say 1/100) probability. Thus we establish a tight time bound of Theta(nk) for the k-median problem for a wide range of values of k. The best previous upper bound for the problem was O(nk), where the O-notation hides polylogarithmic factors in n and k. The best previous lower bound of O(nk) applied only to deterministic k-median algorithms. While we focus our presentation on the k-median objective, all our upper bounds are valid for the k-means objective as well. In this context our algorithm compares favorably to the widely used k-means heuristic, which requires O(nk) time for just one iteration and provides no useful approximation guarantees.
Full work available at URL: https://arxiv.org/abs/1301.0587
Recommendations
- A \(k\)-median algorithm with running time independent of data size
- Learning Theory
- A framework for statistical clustering with constant time approximation algorithms for \(K\)-median and \(K\)-means clustering
- Automata, Languages and Programming
- Sublinear‐time approximation algorithms for clustering via random sampling
Cited In (29)
- Sublinear‐time approximation algorithms for clustering via random sampling
- Title not available (Why is that?)
- Sublinear-time Algorithms
- Small space representations for metric min-sum \(k\)-clustering and their applications
- Title not available (Why is that?)
- Clustering mixtures with almost optimal separation in polynomial time
- A \(k\)-median algorithm with running time independent of data size
- Attainable accuracy guarantee for the \(k\)-medians clustering in [0, 1]
- Deterministic metric 1-median selection with very few queries
- Probabilistic \(k\)-median clustering in data streams
- Linear-time approximation schemes for clustering problems in any dimensions
- Speeding up the EM algorithm for mixture model-based segmentation of magnetic resonance images
- A local search approximation algorithm for \(k\)-means clustering
- A sublinear-time approximation scheme for bin packing
- A streaming algorithm for \(k\)-means with approximate coreset
- Sublinear time approximate clustering
- Approximation algorithms for hierarchical location problems
- A lower bound for metric 1-median selection
- Near-optimal large-scale k-medoids clustering
- Learning Theory
- Near-optimal clustering in the \(k\)-machine model
- Sublinear time approximation of the cost of a metric \(k\)-nearest neighbor graph
- Title not available (Why is that?)
- A framework for statistical clustering with constant time approximation algorithms for \(K\)-median and \(K\)-means clustering
- Automata, Languages and Programming
- Approximate clustering without the approximation
- A FAST k-MEANS IMPLEMENTATION USING CORESETS
- On Las Vegas approximations for metric 1-median selection
- Metric 1-median selection: query complexity vs. approximation ratio
This page was built for publication: Optimal time bounds for approximate clustering
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q703075)