Adjusted Pearson chi-square feature screening for multi-classification with ultrahigh dimensional data (Q1683647)

The statistical setting is that of \(n\) independent observations of a categorical response \(Y\) and of \(p\) covariates \(X_1\), \dots, \(X_p\). Let \(S\) be the subset of \(\{1,\dots,p\}\) with the indices of the covariates that statistically relates to the response. The aim of the paper is to develop an estimator \(\hat{S}\) with the sure screening property. This means that \(S\) with high probability is a subset of \(\hat{S}\). The paper considers both categorical and continuous covariates, but the proposed method works by categorizing continuous covariates at their sample percentiles. The estimator \(\hat{S}\) is constructed by ranking the covariates according to an adjusted Pearson chi-square statistic computed separately between the response and each of the covariates. Here adjusted entails that the classical Pearson chi-square statistic is normalized by the logarithm of the number of levels of the covariate. Lower bounds on the probability \(P(S \subseteq \hat{S})\) entailing the sure screening property are provided and proved. These bounds extend previous research in several ways. Firstly, the paper explicitly accounts for both categorical and continuous covariates. Secondly, the bounds are stated explicitly in terms of asymptotic rates on the number of levels \(R\) of the response, the number of levels \(J_k\) of the \(k\)'th covariate (possibly after categorization of a continuous covariate), the number of observations \(n\), and the number of covariates \(p\). That the number of levels \(R\) and \(J_k\) are allowed to increase with the number of observations \(n\) is referred to as diverging classes in the paper. The paper also contains simulation studies supporting the theoretical results, as well as exemplifying small sample behaviour of the proposed methods. Finally, the paper also contains a real data analysis with \(n=497\) and \(p=1125\). This exemplifies the main application of the proposed methodology, namely when \(p\) is larger than \(n\).

0 references

reviewed by

Bo Markussen

0 references

zbMATH Keywords

variable selection

0 references

continuous and categorical covariates

0 references

diverging classes

0 references

Pearson chi-square statistics