Optimal properties of centroid-based classifiers for very high-dimensional data
From MaRDI portal
Publication:2380097
DOI10.1214/09-AOS736zbMATH Open1183.62104arXiv1002.4781MaRDI QIDQ2380097FDOQ2380097
Authors: Tung H. Pham, Peter Hall
Publication date: 24 March 2010
Published in: The Annals of Statistics (Search for Journal in Brave)
Abstract: We show that scale-adjusted versions of the centroid-based classifier enjoys optimal properties when used to discriminate between two very high-dimensional populations where the principal differences are in location. The scale adjustment removes the tendency of scale differences to confound differences in means. Certain other distance-based methods, for example, those founded on nearest-neighbor distance, do not have optimal performance in the sense that we propose. Our results permit varying degrees of sparsity and signal strength to be treated, and require only mild conditions on dependence of vector components. Additionally, we permit the marginal distributions of vector components to vary extensively. In addition to providing theory we explore numerical properties of a centroid-based classifier, and show that these features reflect theoretical accounts of performance.
Full work available at URL: https://arxiv.org/abs/1002.4781
Recommendations
- Scale adjustments for classifiers in high-dimensional, low sample size settings
- A distance-based, misclassification rate adjusted classifier for multiclass, high-dimensional data
- Median-based classifiers for high-dimensional data
- Robust centroid based classification with minimum error rates for high dimension, low sample size data
- Quantile-based classifiers
classificationhigh-dimensional datasparsitydiscriminationcentroid methoddistance-based classifierslocation differencesminimax performancescale adjustment
Cites Work
- The elements of statistical learning. Data mining, inference, and prediction
- Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data
- Consistent nonparametric regression. Discussion
- Nonlinear time series. Nonparametric and parametric methods
- Regularized estimation of large covariance matrices
- Pattern classification.
- Theoretical Measures of Relative Performance of Classifiers for High Dimensional Data with Small Sample Sizes
- Geometric Representation of High Dimension, Low Sample Size Data
- Bandwidth choice for nonparametric classification
- Scale adjustments for classifiers in high-dimensional, low sample size settings
- Title not available (Why is that?)
- Long- and short-range correlations in genome organization
Cited In (5)
Uses Software
This page was built for publication: Optimal properties of centroid-based classifiers for very high-dimensional data
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q2380097)