Local uncertainty sampling for large-scale multiclass logistic regression (Q2196246)

From MaRDI portal
scientific article
Language Label Description Also known as
English
Local uncertainty sampling for large-scale multiclass logistic regression
scientific article

    Statements

    Local uncertainty sampling for large-scale multiclass logistic regression (English)
    0 references
    0 references
    0 references
    0 references
    0 references
    28 August 2020
    0 references
    For analyzing huge data sets using multiclass logistic regression when computational facilities are not available, one of the often used methods is to subsample a data set which can be accommodated within the available computer resources. There are two types of imbalances in classes, namely marginal imbalance (MI) when some classes are rarer than others and conditional imbalance (CI) when the class labels are easy to predict for most of the observations. For MI binary classification, case control (CC) subsampling is used with an equal number of samples from each class uniformly. In this paper, the authors review one of the earlier subsampling schemes for a binary logistic regression termed as local case control (LCC) sampling. This scheme is shown to fare better than the uniform random sampling with respect to asymptotic variance criterion of the estimates obtained. Next, they propose general subsampling schemes for large scale multiclass logistic regression problems. The method consists of selecting data points with labels that are conditionally uncertain given their local observations based on the predicted probability distribution and then fitting a multiclass logistic model for estimating the model parameter. Simulation and real world data sets, namely MNIST and Web-spam data are considered and it is confirmed that the LUS method fares better than uniform sampling, CC sampling and LCC sampling under various settings. If the full sample size $(n)$ based mle has asymptotic variance $v$, then the LUS has asymptotic variance less than $e v$ $(e>1)$, now based on a sample size of $n/ e$.
    0 references
    0 references
    binary and multiclass logistic regression
    0 references
    local case control sampling
    0 references
    local uncertainty sampling
    0 references

    Identifiers

    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references