Local case-control sampling: efficient subsampling in imbalanced data sets
From MaRDI portal
Abstract: For classification problems with significant class imbalance, subsampling can reduce computational costs at the price of inflated variance in estimating model parameters. We propose a method for subsampling efficiently for logistic regression by adjusting the class balance locally in feature space via an accept-reject scheme. Our method generalizes standard case-control sampling, using a pilot estimate to preferentially select examples whose responses are conditionally rare given their features. The biased subsampling is corrected by a post-hoc analytic adjustment to the parameters. The method is simple and requires one parallelizable scan over the full data set. Standard case-control sampling is inconsistent under model misspecification for the population risk-minimizing coefficients . By contrast, our estimator is consistent for provided that the pilot estimate is. Moreover, under correct specification and with a consistent, independent pilot estimate, our estimator has exactly twice the asymptotic variance of the full-sample MLE - even if the selected subsample comprises a miniscule fraction of the full data set, as happens when the original data are severely imbalanced. The factor of two improves to if we multiply the baseline acceptance probabilities by (and weight points with acceptance probability greater than 1), taking roughly times as many data points into the subsample. Experiments on simulated and real data show that our method can substantially outperform standard case-control subsampling.
Recommendations
- Local uncertainty sampling for large-scale multiclass logistic regression
- Surprise sampling: improving and extending the local case-control sampling
- Optimal subsampling for large sample logistic regression
- Optimal subsampling for softmax regression
- More efficient estimation for logistic regression with optimal subsamples
Cited in
(32)- Conditional characteristic feature screening for massive imbalanced data
- Matrix sketching for supervised classification with imbalanced classes
- Optimal subsampling for softmax regression
- Subdata selection algorithm for linear model discrimination
- Deterministic subsampling for logistic regression with massive data
- Optimal subsampling for large-scale quantile regression
- Semi-supervised inference for case-control binary data under possibly mis-specified logistic models
- Local uncertainty sampling for large-scale multiclass logistic regression
- Post-selection Inference of High-dimensional Logistic Regression Under Case–Control Design
- Fused variable screening for massive imbalanced data
- Optimal subsampling for large‐sample quantile regression with massive data
- Randomized maximum-contrast selection: subagging for large-scale regression
- Optimal subsampling for large sample logistic regression
- More efficient estimation for logistic regression with optimal subsamples
- Optimal Poisson subsampling for softmax regression
- A distance metric-based space-filling subsampling method for nonparametric models
- A two-stage optimal subsampling estimation for missing data problems with large-scale data
- Efficient posterior sampling for high-dimensional imbalanced logistic regression
- Surface temperature monitoring in liver procurement via functional variance change-point analysis
- Surprise sampling: improving and extending the local case-control sampling
- Semi-supervised inference for nonparametric logistic regression
- Likelihood Inference for Large Scale Stochastic Blockmodels With Covariates Based on a Divide-and-Conquer Parallelizable Algorithm With Communication
- A Subsampling Method for Regression Problems Based on Minimum Energy Criterion
- Subsampling in longitudinal models
- Unweighted estimation based on optimal sample under measurement constraints
- Efficient modelling of presence-only species data via local background sampling
- Model constraints independent optimal subsampling probabilities for softmax regression
- Estimating promotion effects in email marketing using a large-scale cross-classified Bayesian joint model for nested imbalanced data
- Rather “Good In, Good Out” Than “Garbage In, Garbage Out”: A Comparison of Various Discrete Subsampling Algorithms Using COVID-19 Data Without a Response Variable
- A semiparametric method for risk prediction using integrated electronic health record data
- A two-part measurement error model to estimate participation in undeclared work and related earnings
- A review on design inspired subsampling for big data
This page was built for publication: Local case-control sampling: efficient subsampling in imbalanced data sets
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q480957)