Are discoveries spurious? Distributions of maximum spurious correlations and their applications (Q1650067): Difference between revisions

From MaRDI portal
Changed an Item
Set OpenAlex properties.
 
(3 intermediate revisions by 3 users not shown)
Property / MaRDI profile type
 
Property / MaRDI profile type: MaRDI publication profile / rank
 
Normal rank
Property / arXiv ID
 
Property / arXiv ID: 1502.04237 / rank
 
Normal rank
Property / cites work
 
Property / cites work: Some nonasymptotic results on resampling in high dimension. I: Confidence regions / rank
 
Normal rank
Property / cites work
 
Property / cites work: Consistent Tests for Stochastic Dominance / rank
 
Normal rank
Property / cites work
 
Property / cites work: Simultaneous analysis of Lasso and Dantzig selector / rank
 
Normal rank
Property / cites work
 
Property / cites work: Q5706832 / rank
 
Normal rank
Property / cites work
 
Property / cites work: Statistics for high-dimensional data. Methods, theory and applications. / rank
 
Normal rank
Property / cites work
 
Property / cites work: Distributions of Angles in Random Packing on Spheres / rank
 
Normal rank
Property / cites work
 
Property / cites work: Limiting laws of coherence of random matrices with applications to testing covariance structure and construction of compressed sensing matrices / rank
 
Normal rank
Property / cites work
 
Property / cites work: Two-Sample Test of High Dimensional Means Under Dependence / rank
 
Normal rank
Property / cites work
 
Property / cites work: Simulation‐based hypothesis testing of high dimensional means under covariance heterogeneity / rank
 
Normal rank
Property / cites work
 
Property / cites work: Generalized bootstrap for estimating equations / rank
 
Normal rank
Property / cites work
 
Property / cites work: Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors / rank
 
Normal rank
Property / cites work
 
Property / cites work: Gaussian approximation of suprema of empirical processes / rank
 
Normal rank
Property / cites work
 
Property / cites work: Multiple testing procedures with applications to genomics. / rank
 
Normal rank
Property / cites work
 
Property / cites work: Q3584739 / rank
 
Normal rank
Property / cites work
 
Property / cites work: Variance Estimation Using Refitted Cross-Validation in Ultrahigh Dimensional Regression / rank
 
Normal rank
Property / cites work
 
Property / cites work: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties / rank
 
Normal rank
Property / cites work
 
Property / cites work: Endogeneity in high dimensions / rank
 
Normal rank
Property / cites work
 
Property / cites work: A Selective Overview of Variable Selection in High Dimensional Feature Space (Invited Review Article) / rank
 
Normal rank
Property / cites work
 
Property / cites work: Strong oracle optimality of folded concave penalized estimation / rank
 
Normal rank
Property / cites work
 
Property / cites work: Testing Against a High Dimensional Alternative / rank
 
Normal rank
Property / cites work
 
Property / cites work: Inference When a Nuisance Parameter Is Not Identified Under the Null Hypothesis / rank
 
Normal rank
Property / cites work
 
Property / cites work: Necessary and sufficient conditions for the asymptotic distributions of coherence of ultra-high dimensional random matrices / rank
 
Normal rank
Property / cites work
 
Property / cites work: Q4864293 / rank
 
Normal rank
Property / cites work
 
Property / cites work: Weak convergence and empirical processes. With applications to statistics / rank
 
Normal rank
Property / cites work
 
Property / cites work: Nearly unbiased variable selection under minimax concave penalty / rank
 
Normal rank
Property / cites work
 
Property / cites work: One-step sparse estimates in nonconcave penalized likelihood models / rank
 
Normal rank
Property / OpenAlex ID
 
Property / OpenAlex ID: W178948881 / rank
 
Normal rank

Latest revision as of 11:00, 30 July 2024

scientific article
Language Label Description Also known as
English
Are discoveries spurious? Distributions of maximum spurious correlations and their applications
scientific article

    Statements

    Are discoveries spurious? Distributions of maximum spurious correlations and their applications (English)
    0 references
    0 references
    0 references
    29 June 2018
    0 references
    The paper under review studies whether the data mining approaches be spurious due to high dimensionality and limited sample size, and to use the multiple bootstrap procedure to approximate the unknown distributions and to establish the consistency of such an approach. Various statistical and machine learning methods and algorithms have been proposed to find a small group of covariate variables responding such as biological and clinical outcomes. \textit{R. Tibshirani} [J. R. Stat. Soc., Ser. B 58, No. 1, 267--288 (1996; Zbl 0850.62538)] and \textit{J. Fan} and \textit{R. Li} [J. Am. Stat. Assoc. 96, No. 456, 1348--1360 (2001; Zbl 1073.62547)] introduced the LASSO and SCAD respectively based on an exogeneity assumption (all of covariates and the residual of the true model are uncorrelated). \textit{J. Fan} and \textit{Y. Liao} [Ann. Stat. 42, No. 3, 872--917 (2014; Zbl 1305.62113)] and \textit{J. Fan} et al. [Ann. Stat. 42, No. 3, 819--849 (2014; Zbl 1305.62252)] gave evidence that such an ideal assumption might not be true, although it is necessary to have a model selection consistency. The question that data mining techniques produce results that are better than spurious correlation, depends not only upon the correlation between the fitted and observed values, but also on the sample size, numbers of variables selected and the total number of variables. Let \(X\) be a \(p\)-dimensional random vector of the covariates and \(X_S\) be a subset of covariates indexed by \(S\). Let \(\widehat{\mathrm{corr}}_n (\varepsilon, \alpha_S^T X_S)\) be the sample correlation between the random noise \(\varepsilon\) (independent of \(X\)) and \(\alpha_S^T X_S\) based on a sample of size \(n\), where \(\alpha_S\) is a constant vector. The maximum spurious correlation is defined as \[ \widehat{R}_n (s, p) = \max_{|S|=s}\max_{\alpha_S} \widehat{\mathrm{corr}}_n (\varepsilon, \alpha_S^T X_S), \] when \(X\) and \(\varepsilon\) are independent, where the maximum is taken over all \({p\choose s}\) subsets of size \(s\) and all of the linear combinations of the selected \(s\) covariates. For example, to test the null hypothesis \(E(\varepsilon X_j)=0\), \(j=1, \dots, p\), is to compare the maximum correlation with the distribution of \(\widehat{R}_n (1, p)\). \textit{J. Fan} et al. [J. R. Stat. Soc., Ser. B, Stat. Methodol. 74, No. 1, 37--65 (2012)] conducted simulations to demonstrate that the spurious correlation can be very high when \(p\) is large and grows fast with \(s\). There are several challenges to derive the asymptotic distribution of the statistic \(\widehat{R}_n (s, p)\). The paper under review is to use the multiplier bootstrap method and demonstrate its consistency under mild conditions, where the theoretic validity is guaranteed by \textit{A. van der Vaart} and \textit{J. A. Wellner} [Weak convergence and empirical processes. With applications to statistics. New York, NY: Springer (1996; Zbl 0862.60002)] multiplier central limit theorem. Section 2 starts to introduce the concepts of spurious correlation and moment conditions for asymptotic distribution analysis of \(\widehat{R}_n (s, p)\). The assumption of the sampling process is that \(\{\varepsilon_i\}_{i=1}^n\) and \(\{X_i\}_{i=1}^n\) are independent random samples from the distributions of \(\varepsilon\) and \(X\). The \(s\)-sparse minimal and maximal eigenvalues of the covariance matrix \(\Sigma\) are given by \(\phi_{\mathrm{min}}(s)\) and \(\phi_{\mathrm{max}}(s)\) respectively. The \(s\)-sparse condition number of \(\Sigma\) is defined to be \(\gamma_s = \sqrt{\frac{\phi_{\mathrm{max}}(s)}{\phi_{\mathrm{min}}(s)}}\). Section 3 first states Theorem 3.1 (the Berry-Esseen bound depends explicitly on the triple \((s, p, n)\)), under previous assumptions in Section 2 to bound \(\sup_{t\geq 0} |P[\sqrt{n}\widehat{R}_n (s, p) \leq t] - P[R^*(s, p)\leq t]|\) for \(R^*(s, p) =\sup_{f\in F}G^* f\) a centered Gaussian process indexed by \(F\). The proof of Theorem 3.1 relies on several technical tools from standard covering argument maximal and concentration inequalities as well as a coupling inequality for the maxima of sums of random vectors derived in [\textit{V. Chernozhukov} et al., Ann. Stat. 42, No. 4, 1564--1597 (2014; Zbl 1317.60038)]. Proposition 3.1 establishes the approximation of the joint distributions when both the dimension \(p\) and sparsity \(s\) are allowed to diverge with the sample size \(n\). Proposition 3.2 establishes the limiting distribution of the sum of the top \(s\) order statistics of i.i.d. chi-square random variables with degree of freedom 1. Theorem 3.1 and Theorem 3.2 show that the maximum spurious correlation \(\widehat{R}_n (s, p)\) can be approximated in distribution by the multiplier bootstrap statistic \(n^{-1/2}R_n^{MB}(s, p)\). Section 4 extends to sparse linear models \(Y=X^T\beta^* + \varepsilon\) for sparse regression coefficient \(\beta^*\). For a given random sample \(\{(X_i, Y_i)\}_{i=1}^n\), the SCAD exploits the sparsity by \(p_{\lambda}\)-regularization, which minimizes \[ \frac{1}{2n}\sum_{i=1}^n (Y_i - X_i^T\beta)^2 + \sum_{j=1}^p p_{\lambda}(|\beta_j|; a), \] where \(p_{\lambda}(\cdot ; a)\) is the SCAD penalty function defined in [Fan and Li (2001; Zbl 1073.62547)]. Theorem 4.1 states that the maximum spurious correlation \(\hat{R}_n^{\mathrm{oracle}}(1, p)\) has the limiting distribution of \(Z\sim N(0, \Sigma_{22, 1})\), where \(Z\) is a d-variate centered Gaussian random vector with covariance matrix \(\Sigma_{22, 1}\) in \(\Sigma =\begin{pmatrix} \Sigma_{11} & \Sigma_{12}\\ \Sigma_{21} & \Sigma_{22}\end{pmatrix}\). Similar result for \(\hat{R}_n^{LLA}(1, p)\) also holds in Theorem 4.2 with restricted eigenvalue formulated by \textit{P. J. Bickel} et al. [Ann. Stat. 37, No. 4, 1705--1732 (2009; Zbl 1173.62022)]. Section 5 outlines three applications on (1) determining whether discoveries by machine learning and data mining techniques are any better than those reached by chance, (2) applying on model selection by using the distributions of maximum spurious correlations, and (3) validating the fundamental assumption of exogeneity in high dimensions. Section 6 presents Monte Carlo simulations for finite sample performance of the bootstrap approximation of the distribution of the maximum spurious correlation (MSC), through the computation intensity on spurious correlation, accuracy of the multiplier bootstrap approximation, detecting spurious discoveries, model selection, gene expression data. Some technical lemmas, proofs of Theorem 3.1 are given in Section 7, and the rest is in the supplementary material.
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    high dimension
    0 references
    spurious correlation
    0 references
    multiplier bootstrap
    0 references
    false discovery
    0 references
    Lasso
    0 references
    SCAD
    0 references
    exogeneity
    0 references
    data mining
    0 references
    machine learning
    0 references
    asymptotic distribution
    0 references
    consistency
    0 references
    sub-Gaussian
    0 references
    sparse linear model
    0 references
    covariate
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references