Local uncertainty sampling for large-scale multiclass logistic regression (Q2196246)

For analyzing huge data sets using multiclass logistic regression when computational facilities are not available, one of the often used methods is to subsample a data set which can be accommodated within the available computer resources. There are two types of imbalances in classes, namely marginal imbalance (MI) when some classes are rarer than others and conditional imbalance (CI) when the class labels are easy to predict for most of the observations. For MI binary classification, case control (CC) subsampling is used with an equal number of samples from each class uniformly. In this paper, the authors review one of the earlier subsampling schemes for a binary logistic regression termed as local case control (LCC) sampling. This scheme is shown to fare better than the uniform random sampling with respect to asymptotic variance criterion of the estimates obtained. Next, they propose general subsampling schemes for large scale multiclass logistic regression problems. The method consists of selecting data points with labels that are conditionally uncertain given their local observations based on the predicted probability distribution and then fitting a multiclass logistic model for estimating the model parameter. Simulation and real world data sets, namely MNIST and Web-spam data are considered and it is confirmed that the LUS method fares better than uniform sampling, CC sampling and LCC sampling under various settings. If the full sample size $(n)$ based mle has asymptotic variance $v$, then the LUS has asymptotic variance less than $e v$ $(e>1)$, now based on a sample size of $n/ e$.

0 references

reviewed by

T. J. Rao

0 references

zbMATH Keywords

binary and multiclass logistic regression

0 references

local case control sampling

0 references

local uncertainty sampling

0 references

MaRDI profile type