Feature screening in large scale cluster analysis

DOI10.1016/J.JMVA.2017.08.001MaRDI QIDQ153097zbMATH OpenOpenAlexFDO

Authors Trambak Banerjee, Gourab Mukherjee, Peter Radchenko, Trambak Banerjee, Gourab Mukherjee, Peter Radchenko

Publication date September 2017

Published in Journal of Multivariate Analysis (Search for Journal in Brave)

Full work available at URL https://arxiv.org/abs/1701.02857

empirical processes high-dimensionality convex clustering modality detection non-asymptotic screening rate RNA-Seq data single-cell biology

Mathematics Subject Classification ID

Classification and discrimination; cluster analysis (statistical aspects) (62H30) Applications of statistics to biology and medical sciences; meta analysis (62P10)

Abstract: We propose a novel methodology for feature screening in clustering massive datasets, in which both the number of features and the number of observations can potentially be very large. Taking advantage of a fusion penalization based convex clustering criterion, we propose a very fast screening procedure that efficiently discards non-informative features by first computing a clustering score corresponding to the clustering tree constructed for each feature, and then thresholding the resulting values. We provide theoretical support for our approach by establishing uniform non-asymptotic bounds on the clustering scores of the "noise" features. These bounds imply perfect screening of non-informative features with high probability and are derived via careful analysis of the empirical processes corresponding to the clustering trees that are constructed for each of the features by the associated clustering procedure. Through extensive simulation experiments we compare the performance of our proposed method with other screening approaches, popularly used in cluster analysis, and obtain encouraging results. We demonstrate empirically that our method is applicable to cluster analysis of big datasets arising in single-cell gene expression studies.

Recommendations

Cites work

Cited in

(2)

Describes a project that uses

Uses Software

This page was built for publication: Feature screening in large scale cluster analysis

Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q153097)