Are discoveries spurious? Distributions of maximum spurious correlations and their applications (Q1650067): Difference between revisions

The paper under review studies whether the data mining approaches be spurious due to high dimensionality and limited sample size, and to use the multiple bootstrap procedure to approximate the unknown distributions and to establish the consistency of such an approach. Various statistical and machine learning methods and algorithms have been proposed to find a small group of covariate variables responding such as biological and clinical outcomes. \textit{R. Tibshirani} [J. R. Stat. Soc., Ser. B 58, No. 1, 267--288 (1996; Zbl 0850.62538)] and \textit{J. Fan} and \textit{R. Li} [J. Am. Stat. Assoc. 96, No. 456, 1348--1360 (2001; Zbl 1073.62547)] introduced the LASSO and SCAD respectively based on an exogeneity assumption (all of covariates and the residual of the true model are uncorrelated). \textit{J. Fan} and \textit{Y. Liao} [Ann. Stat. 42, No. 3, 872--917 (2014; Zbl 1305.62113)] and \textit{J. Fan} et al. [Ann. Stat. 42, No. 3, 819--849 (2014; Zbl 1305.62252)] gave evidence that such an ideal assumption might not be true, although it is necessary to have a model selection consistency. The question that data mining techniques produce results that are better than spurious correlation, depends not only upon the correlation between the fitted and observed values, but also on the sample size, numbers of variables selected and the total number of variables. Let \(X\) be a \(p\)-dimensional random vector of the covariates and \(X_S\) be a subset of covariates indexed by \(S\). Let \(\widehat{\mathrm{corr}}_n (\varepsilon, \alpha_S^T X_S)\) be the sample correlation between the random noise \(\varepsilon\) (independent of \(X\)) and \(\alpha_S^T X_S\) based on a sample of size \(n\), where \(\alpha_S\) is a constant vector. The maximum spurious correlation is defined as \[ \widehat{R}_n (s, p) = \max_{|S|=s}\max_{\alpha_S} \widehat{\mathrm{corr}}_n (\varepsilon, \alpha_S^T X_S), \] when \(X\) and \(\varepsilon\) are independent, where the maximum is taken over all \({p\choose s}\) subsets of size \(s\) and all of the linear combinations of the selected \(s\) covariates. For example, to test the null hypothesis \(E(\varepsilon X_j)=0\), \(j=1, \dots, p\), is to compare the maximum correlation with the distribution of \(\widehat{R}_n (1, p)\). \textit{J. Fan} et al. [J. R. Stat. Soc., Ser. B, Stat. Methodol. 74, No. 1, 37--65 (2012)] conducted simulations to demonstrate that the spurious correlation can be very high when \(p\) is large and grows fast with \(s\). There are several challenges to derive the asymptotic distribution of the statistic \(\widehat{R}_n (s, p)\). The paper under review is to use the multiplier bootstrap method and demonstrate its consistency under mild conditions, where the theoretic validity is guaranteed by \textit{A. van der Vaart} and \textit{J. A. Wellner} [Weak convergence and empirical processes. With applications to statistics. New York, NY: Springer (1996; Zbl 0862.60002)] multiplier central limit theorem. Section 2 starts to introduce the concepts of spurious correlation and moment conditions for asymptotic distribution analysis of \(\widehat{R}_n (s, p)\). The assumption of the sampling process is that \(\{\varepsilon_i\}_{i=1}^n\) and \(\{X_i\}_{i=1}^n\) are independent random samples from the distributions of \(\varepsilon\) and \(X\). The \(s\)-sparse minimal and maximal eigenvalues of the covariance matrix \(\Sigma\) are given by \(\phi_{\mathrm{min}}(s)\) and \(\phi_{\mathrm{max}}(s)\) respectively. The \(s\)-sparse condition number of \(\Sigma\) is defined to be \(\gamma_s = \sqrt{\frac{\phi_{\mathrm{max}}(s)}{\phi_{\mathrm{min}}(s)}}\). Section 3 first states Theorem 3.1 (the Berry-Esseen bound depends explicitly on the triple \((s, p, n)\)), under previous assumptions in Section 2 to bound \(\sup_{t\geq 0} |P[\sqrt{n}\widehat{R}_n (s, p) \leq t] - P[R^*(s, p)\leq t]|\) for \(R^*(s, p) =\sup_{f\in F}G^* f\) a centered Gaussian process indexed by \(F\). The proof of Theorem 3.1 relies on several technical tools from standard covering argument maximal and concentration inequalities as well as a coupling inequality for the maxima of sums of random vectors derived in [\textit{V. Chernozhukov} et al., Ann. Stat. 42, No. 4, 1564--1597 (2014; Zbl 1317.60038)]. Proposition 3.1 establishes the approximation of the joint distributions when both the dimension \(p\) and sparsity \(s\) are allowed to diverge with the sample size \(n\). Proposition 3.2 establishes the limiting distribution of the sum of the top \(s\) order statistics of i.i.d. chi-square random variables with degree of freedom 1. Theorem 3.1 and Theorem 3.2 show that the maximum spurious correlation \(\widehat{R}_n (s, p)\) can be approximated in distribution by the multiplier bootstrap statistic \(n^{-1/2}R_n^{MB}(s, p)\). Section 4 extends to sparse linear models \(Y=X^T\beta^* + \varepsilon\) for sparse regression coefficient \(\beta^*\). For a given random sample \(\{(X_i, Y_i)\}_{i=1}^n\), the SCAD exploits the sparsity by \(p_{\lambda}\)-regularization, which minimizes \[ \frac{1}{2n}\sum_{i=1}^n (Y_i - X_i^T\beta)^2 + \sum_{j=1}^p p_{\lambda}(|\beta_j|; a), \] where \(p_{\lambda}(\cdot ; a)\) is the SCAD penalty function defined in [Fan and Li (2001; Zbl 1073.62547)]. Theorem 4.1 states that the maximum spurious correlation \(\hat{R}_n^{\mathrm{oracle}}(1, p)\) has the limiting distribution of \(Z\sim N(0, \Sigma_{22, 1})\), where \(Z\) is a d-variate centered Gaussian random vector with covariance matrix \(\Sigma_{22, 1}\) in \(\Sigma =\begin{pmatrix} \Sigma_{11} & \Sigma_{12}\\ \Sigma_{21} & \Sigma_{22}\end{pmatrix}\). Similar result for \(\hat{R}_n^{LLA}(1, p)\) also holds in Theorem 4.2 with restricted eigenvalue formulated by \textit{P. J. Bickel} et al. [Ann. Stat. 37, No. 4, 1705--1732 (2009; Zbl 1173.62022)]. Section 5 outlines three applications on (1) determining whether discoveries by machine learning and data mining techniques are any better than those reached by chance, (2) applying on model selection by using the distributions of maximum spurious correlations, and (3) validating the fundamental assumption of exogeneity in high dimensions. Section 6 presents Monte Carlo simulations for finite sample performance of the bootstrap approximation of the distribution of the maximum spurious correlation (MSC), through the computation intensity on spurious correlation, accuracy of the multiplier bootstrap approximation, detecting spurious discoveries, model selection, gene expression data. Some technical lemmas, proofs of Theorem 3.1 are given in Section 7, and the rest is in the supplementary material.

0 references

Mathematics Subject Classification ID

62H10

0 references

0 references

0 references

0 references

0 references

0 references

high dimension

0 references

spurious correlation

0 references

multiplier bootstrap

0 references

false discovery

0 references

Lasso

0 references

SCAD

0 references

exogeneity

0 references

data mining

0 references

machine learning

0 references

asymptotic distribution

0 references

consistency

0 references

sub-Gaussian

0 references

sparse linear model

0 references

covariate

0 references

Wikidata QID

Q55399837

0 references

describes a project that uses

ElemStatLearn

0 references

MaRDI profile type

MaRDI publication profile

0 references

arXiv ID

1502.04237

0 references

cites work

Some nonasymptotic results on resampling in high dimension. I: Confidence regions

0 references

Consistent Tests for Stochastic Dominance

0 references

Simultaneous analysis of Lasso and Dantzig selector

0 references

Q5706832

0 references

Statistics for high-dimensional data. Methods, theory and applications.

0 references

Distributions of Angles in Random Packing on Spheres

0 references

Limiting laws of coherence of random matrices with applications to testing covariance structure and construction of compressed sensing matrices

0 references

Two-Sample Test of High Dimensional Means Under Dependence

0 references

Simulation‐based hypothesis testing of high dimensional means under covariance heterogeneity

0 references

Generalized bootstrap for estimating equations

0 references

Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors

0 references

Gaussian approximation of suprema of empirical processes

0 references

Multiple testing procedures with applications to genomics.

0 references

Q3584739

0 references

Variance Estimation Using Refitted Cross-Validation in Ultrahigh Dimensional Regression

0 references

Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties

0 references

Endogeneity in high dimensions

0 references

A Selective Overview of Variable Selection in High Dimensional Feature Space (Invited Review Article)

0 references

Strong oracle optimality of folded concave penalized estimation

0 references

Testing Against a High Dimensional Alternative

0 references

Inference When a Nuisance Parameter Is Not Identified Under the Null Hypothesis

0 references

Necessary and sufficient conditions for the asymptotic distributions of coherence of ultra-high dimensional random matrices

0 references

Q4864293

0 references

Weak convergence and empirical processes. With applications to statistics

0 references

Nearly unbiased variable selection under minimax concave penalty

0 references

One-step sparse estimates in nonconcave penalized likelihood models

0 references

OpenAlex ID

W178948881

0 references

Sitelinks

Mathematics(1 entry)

mardi Publication:1650067

@@ Property / MaRDI profile type @@
+MaRDI publication profile
@@ Property / MaRDI profile type: MaRDI publication profile / rank @@
+Normal rank
@@ Property / arXiv ID @@
+.04237
@@ Property / arXiv ID: 1502.04237 / rank @@
+Normal rank
@@ Property / cites work @@
+Some nonasymptotic results on resampling in high dimension. I: Confidence regions
+Normal rank
@@ Property / cites work @@
+Consistent Tests for Stochastic Dominance
@@ Property / cites work: Consistent Tests for Stochastic Dominance / rank @@
+Normal rank
@@ Property / cites work @@
+Simultaneous analysis of Lasso and Dantzig selector
+Normal rank
@@ Property / cites work @@
+Q5706832
@@ Property / cites work: Q5706832 / rank @@
+Normal rank
@@ Property / cites work @@
+Statistics for high-dimensional data. Methods, theory and applications.
+Normal rank
@@ Property / cites work @@
+Distributions of Angles in Random Packing on Spheres
+Normal rank
@@ Property / cites work @@
+Limiting laws of coherence of random matrices with applications to testing covariance structure and construction of compressed sensing matrices
+Normal rank
@@ Property / cites work @@
+Two-Sample Test of High Dimensional Means Under Dependence
+Normal rank
@@ Property / cites work @@
+Simulation‐based hypothesis testing of high dimensional means under covariance heterogeneity
+Normal rank
@@ Property / cites work @@
+Generalized bootstrap for estimating equations
@@ Property / cites work: Generalized bootstrap for estimating equations / rank @@
+Normal rank
@@ Property / cites work @@
+Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors
+Normal rank
@@ Property / cites work @@
+Gaussian approximation of suprema of empirical processes
+Normal rank
@@ Property / cites work @@
+Multiple testing procedures with applications to genomics.
+Normal rank
@@ Property / cites work @@
+Q3584739
@@ Property / cites work: Q3584739 / rank @@
+Normal rank
@@ Property / cites work @@
+Variance Estimation Using Refitted Cross-Validation in Ultrahigh Dimensional Regression
+Normal rank
@@ Property / cites work @@
+Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties
+Normal rank
@@ Property / cites work @@
+Endogeneity in high dimensions
@@ Property / cites work: Endogeneity in high dimensions / rank @@
+Normal rank
@@ Property / cites work @@
+A Selective Overview of Variable Selection in High Dimensional Feature Space (Invited Review Article)
+Normal rank
@@ Property / cites work @@
+Strong oracle optimality of folded concave penalized estimation
+Normal rank
@@ Property / cites work @@
+Testing Against a High Dimensional Alternative
@@ Property / cites work: Testing Against a High Dimensional Alternative / rank @@
+Normal rank
@@ Property / cites work @@
+Inference When a Nuisance Parameter Is Not Identified Under the Null Hypothesis
+Normal rank
@@ Property / cites work @@
+Necessary and sufficient conditions for the asymptotic distributions of coherence of ultra-high dimensional random matrices
+Normal rank
@@ Property / cites work @@
+Q4864293
@@ Property / cites work: Q4864293 / rank @@
+Normal rank
@@ Property / cites work @@
+Weak convergence and empirical processes. With applications to statistics
+Normal rank
@@ Property / cites work @@
+Nearly unbiased variable selection under minimax concave penalty
+Normal rank
@@ Property / cites work @@
+One-step sparse estimates in nonconcave penalized likelihood models
+Normal rank
@@ Property / OpenAlex ID @@
+W178948881
@@ Property / OpenAlex ID: W178948881 / rank @@
+Normal rank