Adjusted Pearson chi-square feature screening for multi-classification with ultrahigh dimensional data (Q1683647)

From MaRDI portal
Revision as of 20:12, 14 July 2024 by ReferenceBot (talk | contribs) (‎Changed an Item)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
scientific article
Language Label Description Also known as
English
Adjusted Pearson chi-square feature screening for multi-classification with ultrahigh dimensional data
scientific article

    Statements

    Adjusted Pearson chi-square feature screening for multi-classification with ultrahigh dimensional data (English)
    0 references
    0 references
    0 references
    0 references
    0 references
    1 December 2017
    0 references
    The statistical setting is that of \(n\) independent observations of a categorical response \(Y\) and of \(p\) covariates \(X_1\), \dots, \(X_p\). Let \(S\) be the subset of \(\{1,\dots,p\}\) with the indices of the covariates that statistically relates to the response. The aim of the paper is to develop an estimator \(\hat{S}\) with the sure screening property. This means that \(S\) with high probability is a subset of \(\hat{S}\). The paper considers both categorical and continuous covariates, but the proposed method works by categorizing continuous covariates at their sample percentiles. The estimator \(\hat{S}\) is constructed by ranking the covariates according to an adjusted Pearson chi-square statistic computed separately between the response and each of the covariates. Here adjusted entails that the classical Pearson chi-square statistic is normalized by the logarithm of the number of levels of the covariate. Lower bounds on the probability \(P(S \subseteq \hat{S})\) entailing the sure screening property are provided and proved. These bounds extend previous research in several ways. Firstly, the paper explicitly accounts for both categorical and continuous covariates. Secondly, the bounds are stated explicitly in terms of asymptotic rates on the number of levels \(R\) of the response, the number of levels \(J_k\) of the \(k\)'th covariate (possibly after categorization of a continuous covariate), the number of observations \(n\), and the number of covariates \(p\). That the number of levels \(R\) and \(J_k\) are allowed to increase with the number of observations \(n\) is referred to as diverging classes in the paper. The paper also contains simulation studies supporting the theoretical results, as well as exemplifying small sample behaviour of the proposed methods. Finally, the paper also contains a real data analysis with \(n=497\) and \(p=1125\). This exemplifies the main application of the proposed methodology, namely when \(p\) is larger than \(n\).
    0 references
    0 references
    variable selection
    0 references
    continuous and categorical covariates
    0 references
    diverging classes
    0 references
    Pearson chi-square statistics
    0 references
    sure screening property
    0 references

    Identifiers