Feature screening in large scale cluster analysis
From MaRDI portal
Abstract: We propose a novel methodology for feature screening in clustering massive datasets, in which both the number of features and the number of observations can potentially be very large. Taking advantage of a fusion penalization based convex clustering criterion, we propose a very fast screening procedure that efficiently discards non-informative features by first computing a clustering score corresponding to the clustering tree constructed for each feature, and then thresholding the resulting values. We provide theoretical support for our approach by establishing uniform non-asymptotic bounds on the clustering scores of the "noise" features. These bounds imply perfect screening of non-informative features with high probability and are derived via careful analysis of the empirical processes corresponding to the clustering trees that are constructed for each of the features by the associated clustering procedure. Through extensive simulation experiments we compare the performance of our proposed method with other screening approaches, popularly used in cluster analysis, and obtain encouraging results. We demonstrate empirically that our method is applicable to cluster analysis of big datasets arising in single-cell gene expression studies.
Recommendations
Cites work
- scientific article; zbMATH DE number 3456274 (Why is no real title available?)
- scientific article; zbMATH DE number 720689 (Why is no real title available?)
- scientific article; zbMATH DE number 6122810 (Why is no real title available?)
- A framework for feature selection in clustering
- A simple approach to sparse clustering
- Algorithm AS 136: A K-Means Clustering Algorithm
- An introduction to statistical learning. With applications in R
- Calibrating the Excess Mass and Dip Tests of Modality
- Clustering Objects on Subsets of Attributes (with Discussion)
- Convex clustering via \(l_1\) fusion penalization
- Detection and feature selection in sparse mixture models
- Estimation of a Convex Density Contour in Two Dimensions
- Grouping pursuit through a regularization solution surface
- Higher criticism for detecting sparse heterogeneous mixtures.
- Higher criticism thresholding: Optimal feature selection when useful features are rare and weak
- Hybrid hierarchical clustering with applications to microarray data
- Influential features PCA for high dimensional clustering
- On Using Principal Components Before Separating a Mixture of Two Multivariate Normal Distributions
- On consistency and sparsity for principal components analysis in high dimensions
- Optimal screening and discovery of sparse signals with applications to multistage high throughput studies
- Penalized model-based clustering
- Penalized model-based clustering with application to variable selection
- Phase transitions for high dimensional clustering and related problems
- Simultaneous Regression Shrinkage, Variable Selection, and Supervised Clustering of Predictors with OSCAR
- Simultaneous supervised clustering and feature selection over a graph
- Size, power and false discovery rates
- The dip test of unimodality
- The elements of statistical learning. Data mining, inference, and prediction
- Using Evidence of Mixed Populations to Select Variables for Clustering Very High-Dimensional Data
- Using specially designed exponential families for density estimation
- Variable Selection for Model-Based High-Dimensional Clustering and Its Application to Microarray Data
- Weak convergence and empirical processes. With applications to statistics
This page was built for publication: Feature screening in large scale cluster analysis
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q153097)