Local uncertainty sampling for large-scale multiclass logistic regression (Q2196246): Difference between revisions
From MaRDI portal
Latest revision as of 08:56, 30 July 2024
scientific article
Language | Label | Description | Also known as |
---|---|---|---|
English | Local uncertainty sampling for large-scale multiclass logistic regression |
scientific article |
Statements
Local uncertainty sampling for large-scale multiclass logistic regression (English)
0 references
28 August 2020
0 references
For analyzing huge data sets using multiclass logistic regression when computational facilities are not available, one of the often used methods is to subsample a data set which can be accommodated within the available computer resources. There are two types of imbalances in classes, namely marginal imbalance (MI) when some classes are rarer than others and conditional imbalance (CI) when the class labels are easy to predict for most of the observations. For MI binary classification, case control (CC) subsampling is used with an equal number of samples from each class uniformly. In this paper, the authors review one of the earlier subsampling schemes for a binary logistic regression termed as local case control (LCC) sampling. This scheme is shown to fare better than the uniform random sampling with respect to asymptotic variance criterion of the estimates obtained. Next, they propose general subsampling schemes for large scale multiclass logistic regression problems. The method consists of selecting data points with labels that are conditionally uncertain given their local observations based on the predicted probability distribution and then fitting a multiclass logistic model for estimating the model parameter. Simulation and real world data sets, namely MNIST and Web-spam data are considered and it is confirmed that the LUS method fares better than uniform sampling, CC sampling and LCC sampling under various settings. If the full sample size $(n)$ based mle has asymptotic variance $v$, then the LUS has asymptotic variance less than $e v$ $(e>1)$, now based on a sample size of $n/ e$.
0 references
binary and multiclass logistic regression
0 references
local case control sampling
0 references
local uncertainty sampling
0 references
0 references