New hard-thresholding rules based on data splitting in high-dimensional imbalanced classification
From MaRDI portal
Abstract: In binary classification, imbalance refers to situations in which one class is heavily under-represented. This issue is due to either a data collection process or because one class is indeed rare in a population. Imbalanced classification frequently arises in applications such as biology, medicine, engineering, and social sciences. In this paper, for the first time, we theoretically study the impact of imbalance class sizes on the linear discriminant analysis (LDA) in high dimensions. We show that due to data scarcity in one class, referred to as the minority class, and high-dimensionality of the feature space, the LDA ignores the minority class yielding a maximum misclassification rate. We then propose a new construction of hard-thresholding rules based on a data splitting technique that reduces the large difference between the misclassification rates. We show that the proposed method is asymptotically optimal. We further study two well-known sparse versions of the LDA in imbalanced cases. We evaluate the finite-sample performance of different methods using simulations and by analyzing two real data sets. The results show that our method either outperforms its competitors or has comparable performance based on a much smaller subset of selected features, while being computationally more efficient.
Recommendations
Cites work
- scientific article; zbMATH DE number 5957445 (Why is no real title available?)
- scientific article; zbMATH DE number 1753143 (Why is no real title available?)
- A road to classification in high dimensional space: the regularized optimal affine discriminant
- Adaptive Weighted Learning for Unbalanced Multicategory Classification
- Bias-corrected diagonal discriminant rules for high-dimensional classification
- Covariance regularization by thresholding
- Distance-weighted support vector machine
- Effect of heavy tails on ultra high dimensional variable ranking methods
- Flexible high-dimensional classification machines and their asymptotic properties
- Fused variable screening for massive imbalanced data
- Geometric Representation of High Dimension, Low Sample Size Data
- Gradient boosting for high-dimensional prediction of rare events
- High dimensional classifiers in the imbalanced case
- High-dimensional classification using features annealed independence rules
- Multiclass linear discriminant analysis with ultrahigh-dimensional features
- Optimal variable selection in multi-group sparse discriminant analysis
- Penalized classification using Fisher's linear discriminant
- Regularized estimation of large covariance matrices
- Regularized linear discriminant analysis and its application in microarrays
- Some theory for Fisher's linear discriminant function, `naive Bayes', and some alternatives when there are many more variables than observations
- Sparse Quadratic Discriminant Analysis For High Dimensional Data
- Sparse linear discriminant analysis by thresholding for high dimensional data
- Stability Selection
- Statistical fraud detection: a review
- Support vector machine and its bias correction in high-dimension, low-sample-size settings
- Sure independence screening for ultrahigh dimensional feature space. With discussion and authors' reply
- The design of polynomial function-based neural network predictors for detection of software defects
- The maximal data piling direction for discrimination
- Weighted distance weighted discrimination and its asymptotic properties
- \(p\)-values for high-dimensional regression
Cited in
(5)- Do unbalanced data have a negative effect on LDA?
- The effect of imbalanced data sets on LDA: a theoretical and empirical analysis
- Efficient posterior sampling for high-dimensional imbalanced logistic regression
- High dimensional classifiers in the imbalanced case
- Threshold optimization for classification in imbalanced data in a problem of gamma-ray astronomy
This page was built for publication: New hard-thresholding rules based on data splitting in high-dimensional imbalanced classification
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q2136627)