Local case-control sampling: efficient subsampling in imbalanced data sets

DOI10.1214/14-AOS1220MaRDI QIDQ480957zbMATH OpenOpenAlexWikidataFDO

Authors William Fithian, Trevor Hastie

Publication date 12 December 2014

Published in The Annals of Statistics (Search for Journal in Brave)

Full work available at URL https://arxiv.org/abs/1306.3706

zbMATH Keywords

logistic regression subsampling case-control sampling

Mathematics Subject Classification ID

Point estimation (62F10) Sampling theory, sample surveys (62D05)

Abstract: For classification problems with significant class imbalance, subsampling can reduce computational costs at the price of inflated variance in estimating model parameters. We propose a method for subsampling efficiently for logistic regression by adjusting the class balance locally in feature space via an accept-reject scheme. Our method generalizes standard case-control sampling, using a pilot estimate to preferentially select examples whose responses are conditionally rare given their features. The biased subsampling is corrected by a post-hoc analytic adjustment to the parameters. The method is simple and requires one parallelizable scan over the full data set. Standard case-control sampling is inconsistent under model misspecification for the population risk-minimizing coefficients

h e t a^{*}

. By contrast, our estimator is consistent for

h e t a^{*}

provided that the pilot estimate is. Moreover, under correct specification and with a consistent, independent pilot estimate, our estimator has exactly twice the asymptotic variance of the full-sample MLE - even if the selected subsample comprises a miniscule fraction of the full data set, as happens when the original data are severely imbalanced. The factor of two improves to

1 + f r a c 1 c

if we multiply the baseline acceptance probabilities by

c > 1

(and weight points with acceptance probability greater than 1), taking roughly

f r a c 1 + c 2

times as many data points into the subsample. Experiments on simulated and real data show that our method can substantially outperform standard case-control subsampling.

Recommendations

Cited in

(46)

This page was built for publication: Local case-control sampling: efficient subsampling in imbalanced data sets

Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q480957)