Are discoveries spurious? Distributions of maximum spurious correlations and their applications (Q1650067)

The paper under review studies whether the data mining approaches be spurious due to high dimensionality and limited sample size, and to use the multiple bootstrap procedure to approximate the unknown distributions and to establish the consistency of such an approach. Various statistical and machine learning methods and algorithms have been proposed to find a small group of covariate variables responding such as biological and clinical outcomes. \textit{R. Tibshirani} [J. R. Stat. Soc., Ser. B 58, No. 1, 267--288 (1996; Zbl 0850.62538)] and \textit{J. Fan} and \textit{R. Li} [J. Am. Stat. Assoc. 96, No. 456, 1348--1360 (2001; Zbl 1073.62547)] introduced the LASSO and SCAD respectively based on an exogeneity assumption (all of covariates and the residual of the true model are uncorrelated). \textit{J. Fan} and \textit{Y. Liao} [Ann. Stat. 42, No. 3, 872--917 (2014; Zbl 1305.62113)] and \textit{J. Fan} et al. [Ann. Stat. 42, No. 3, 819--849 (2014; Zbl 1305.62252)] gave evidence that such an ideal assumption might not be true, although it is necessary to have a model selection consistency. The question that data mining techniques produce results that are better than spurious correlation, depends not only upon the correlation between the fitted and observed values, but also on the sample size, numbers of variables selected and the total number of variables. Let \(X\) be a \(p\)-dimensional random vector of the covariates and \(X_S\) be a subset of covariates indexed by \(S\). Let \(\widehat{\mathrm{corr}}_n (\varepsilon, \alpha_S^T X_S)\) be the sample correlation between the random noise \(\varepsilon\) (independent of \(X\)) and \(\alpha_S^T X_S\) based on a sample of size \(n\), where \(\alpha_S\) is a constant vector. The maximum spurious correlation is defined as \[ \widehat{R}_n (s, p) = \max_{|S|=s}\max_{\alpha_S} \widehat{\mathrm{corr}}_n (\varepsilon, \alpha_S^T X_S), \] when \(X\) and \(\varepsilon\) are independent, where the maximum is taken over all \({p\choose s}\) subsets of size \(s\) and all of the linear combinations of the selected \(s\) covariates. For example, to test the null hypothesis \(E(\varepsilon X_j)=0\), \(j=1, \dots, p\), is to compare the maximum correlation with the distribution of \(\widehat{R}_n (1, p)\). \textit{J. Fan} et al. [J. R. Stat. Soc., Ser. B, Stat. Methodol. 74, No. 1, 37--65 (2012)] conducted simulations to demonstrate that the spurious correlation can be very high when \(p\) is large and grows fast with \(s\). There are several challenges to derive the asymptotic distribution of the statistic \(\widehat{R}_n (s, p)\). The paper under review is to use the multiplier bootstrap method and demonstrate its consistency under mild conditions, where the theoretic validity is guaranteed by \textit{A. van der Vaart} and \textit{J. A. Wellner} [Weak convergence and empirical processes. With applications to statistics. New York, NY: Springer (1996; Zbl 0862.60002)] multiplier central limit theorem. Section 2 starts to introduce the concepts of spurious correlation and moment conditions for asymptotic distribution analysis of \(\widehat{R}_n (s, p)\). The assumption of the sampling process is that \(\{\varepsilon_i\}_{i=1}^n\) and \(\{X_i\}_{i=1}^n\) are independent random samples from the distributions of \(\varepsilon\) and \(X\). The \(s\)-sparse minimal and maximal eigenvalues of the covariance matrix \(\Sigma\) are given by \(\phi_{\mathrm{min}}(s)\) and \(\phi_{\mathrm{max}}(s)\) respectively. The \(s\)-sparse condition number of \(\Sigma\) is defined to be \(\gamma_s = \sqrt{\frac{\phi_{\mathrm{max}}(s)}{\phi_{\mathrm{min}}(s)}}\). Section 3 first states Theorem 3.1 (the Berry-Esseen bound depends explicitly on the triple \((s, p, n)\)), under previous assumptions in Section 2 to bound \(\sup_{t\geq 0} |P[\sqrt{n}\widehat{R}_n (s, p) \leq t] - P[R^*(s, p)\leq t]|\) for \(R^*(s, p) =\sup_{f\in F}G^* f\) a centered Gaussian process indexed by \(F\). The proof of Theorem 3.1 relies on several technical tools from standard covering argument maximal and concentration inequalities as well as a coupling inequality for the maxima of sums of random vectors derived in [\textit{V. Chernozhukov} et al., Ann. Stat. 42, No. 4, 1564--1597 (2014; Zbl 1317.60038)]. Proposition 3.1 establishes the approximation of the joint distributions when both the dimension \(p\) and sparsity \(s\) are allowed to diverge with the sample size \(n\). Proposition 3.2 establishes the limiting distribution of the sum of the top \(s\) order statistics of i.i.d. chi-square random variables with degree of freedom 1. Theorem 3.1 and Theorem 3.2 show that the maximum spurious correlation \(\widehat{R}_n (s, p)\) can be approximated in distribution by the multiplier bootstrap statistic \(n^{-1/2}R_n^{MB}(s, p)\). Section 4 extends to sparse linear models \(Y=X^T\beta^* + \varepsilon\) for sparse regression coefficient \(\beta^*\). For a given random sample \(\{(X_i, Y_i)\}_{i=1}^n\), the SCAD exploits the sparsity by \(p_{\lambda}\)-regularization, which minimizes \[ \frac{1}{2n}\sum_{i=1}^n (Y_i - X_i^T\beta)^2 + \sum_{j=1}^p p_{\lambda}(|\beta_j|; a), \] where \(p_{\lambda}(\cdot ; a)\) is the SCAD penalty function defined in [Fan and Li (2001; Zbl 1073.62547)]. Theorem 4.1 states that the maximum spurious correlation \(\hat{R}_n^{\mathrm{oracle}}(1, p)\) has the limiting distribution of \(Z\sim N(0, \Sigma_{22, 1})\), where \(Z\) is a d-variate centered Gaussian random vector with covariance matrix \(\Sigma_{22, 1}\) in \(\Sigma =\begin{pmatrix} \Sigma_{11} & \Sigma_{12}\\ \Sigma_{21} & \Sigma_{22}\end{pmatrix}\). Similar result for \(\hat{R}_n^{LLA}(1, p)\) also holds in Theorem 4.2 with restricted eigenvalue formulated by \textit{P. J. Bickel} et al. [Ann. Stat. 37, No. 4, 1705--1732 (2009; Zbl 1173.62022)]. Section 5 outlines three applications on (1) determining whether discoveries by machine learning and data mining techniques are any better than those reached by chance, (2) applying on model selection by using the distributions of maximum spurious correlations, and (3) validating the fundamental assumption of exogeneity in high dimensions. Section 6 presents Monte Carlo simulations for finite sample performance of the bootstrap approximation of the distribution of the maximum spurious correlation (MSC), through the computation intensity on spurious correlation, accuracy of the multiplier bootstrap approximation, detecting spurious discoveries, model selection, gene expression data. Some technical lemmas, proofs of Theorem 3.1 are given in Section 7, and the rest is in the supplementary material.

0 references

zbMATH Keywords

high dimension

0 references

spurious correlation

0 references

multiplier bootstrap

0 references

false discovery

0 references

Lasso

0 references