Dimensionality reduction for k-means clustering and low rank approximation
From MaRDI portal
Publication:2941504
Abstract: We show how to approximate a data matrix with a much smaller sketch that can be used to solve a general class of constrained k-rank approximation problems to within error. Importantly, this class of problems includes -means clustering and unconstrained low rank approximation (i.e. principal component analysis). By reducing data points to just dimensions, our methods generically accelerate any exact, approximate, or heuristic algorithm for these ubiquitous problems. For -means dimensionality reduction, we provide relative error results for many common sketching techniques, including random row projection, column selection, and approximate SVD. For approximate principal component analysis, we give a simple alternative to known algorithms that has applications in the streaming setting. Additionally, we extend recent work on column-based matrix reconstruction, giving column subsets that not only `cover' a good subspace for , but can be used directly to compute this subspace. Finally, for -means clustering, we show how to achieve a approximation by Johnson-Lindenstrauss projecting data points to just dimensions. This gives the first result that leverages the specific structure of -means to achieve dimension independent of input size and sublinear in .
Recommendations
- Turning big data into tiny data: constant-size coresets for \(k\)-means, PCA and projective clustering
- Turning Big Data Into Tiny Data: Constant-Size Coresets for $k$-Means, PCA, and Projective Clustering
- Oblivious dimension reduction for \(k\)-means: beyond subspaces and the Johnson-Lindenstrauss lemma
- Performance of Johnson--Lindenstrauss Transform for $k$-Means and $k$-Medians Clustering
- Performance of Johnson-Lindenstrauss transform for \(k\)-means and \(k\)-medians clustering
Cites work
- Approximate distance oracles
- Approximate distance oracles with constant query time
- Automata, Languages and Programming
- Distance Oracles for Unweighted Graphs: Breaking the Quadratic Barrier with Constant Additive Error
- Fast Algorithms for Constructing t-Spanners and Paths with Stretch t
- Fast C-K-R partitions of sparse graphs
- Near-Linear Time Construction of Sparse Neighborhood Covers
- On approximate distance labels and routing schemes with affine stretch
- On sparse spanners of weighted graphs
- Ramsey partitions and proximity data structures
- Scale-oblivious metric fragmentation and the nonlinear Dvoretzky theorem
- Shortest-path queries in static networks
Cited in
(41)- An efficient \(K\)-means clustering algorithm for tall data
- Pass-efficient randomized algorithms for low-rank matrix approximation using any number of views
- scientific article; zbMATH DE number 7049775 (Why is no real title available?)
- scientific article; zbMATH DE number 7758314 (Why is no real title available?)
- Turning Big Data Into Tiny Data: Constant-Size Coresets for $k$-Means, PCA, and Projective Clustering
- On coresets for fair clustering in metric and Euclidean spaces and their applications
- Performance of Johnson-Lindenstrauss transform for \(k\)-means and \(k\)-medians clustering
- Oblivious dimension reduction for \(k\)-means: beyond subspaces and the Johnson-Lindenstrauss lemma
- Turning big data into tiny data: constant-size coresets for \(k\)-means, PCA and projective clustering
- Low rank approximation of binary matrices: column subset selection and generalizations
- Scalable kernel \(k\)-means clustering with Nyström approximation: relative-error bounds
- Side-constrained minimum sum-of-squares clustering: mathematical programming and random projections
- Approximating spectral clustering via sampling: a review
- Randomized Dimensionality Reduction for <inline-formula> <tex-math notation="LaTeX">$k$ </tex-math></inline-formula>-Means Clustering
- A Doubly Enhanced EM Algorithm for Model-Based Tensor Clustering
- Efficient binary embedding of categorical data using BinSketch
- Random Projection and Recovery for High Dimensional Optimization with Arbitrary Outliers
- Parameterized \(k\)-clustering: tractability island
- \texttt{pylspack}: parallel algorithms and data structures for sketching, column subset selection, regression, and leverage scores
- Frequent directions: simple and deterministic matrix sketching
- Sketching for principal component regression
- Online row sampling
- Robust communication-optimal distributed clustering algorithms
- Practical sketching algorithms for low-rank matrix approximation
- Infinite lattice learner: an ensemble for incremental learning
- Reduced-Rank Modeling for High-Dimensional Model-Based Clustering
- scientific article; zbMATH DE number 6850339 (Why is no real title available?)
- An Improved Analysis and Unified Perspective on Deterministic and Randomized Low-Rank Matrix Approximation
- Reduced \(k\)-means clustering with MCA in a low-dimensional space
- On coresets for support vector machines
- Random projections for Bayesian regression
- Performance of Johnson--Lindenstrauss Transform for $k$-Means and $k$-Medians Clustering
- SketchySGD: reliable stochastic optimization via randomized curvature estimates
- On using Toeplitz and circulant matrices for Johnson-Lindenstrauss transforms
- Streaming low-rank matrix approximation with an application to scientific simulation
- Structural conditions for projection-cost preservation via randomized matrix multiplication
- Exemplar-based low-rank matrix decomposition for data clustering
- On using Toeplitz and circulant matrices for Johnson-Lindenstrauss transforms
- scientific article; zbMATH DE number 7164768 (Why is no real title available?)
- Core-sets: updated survey
- A novel method for optimizing spectral rotation embedding \(K\)-means with coordinate descent
This page was built for publication: Dimensionality reduction for \(k\)-means clustering and low rank approximation
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q2941504)