A partially linear framework for massive heterogeneous data

DOI10.1214/15-AOS1410zbMATH Open1358.62050arXiv1410.8570WikidataQ36352871 ScholiaQ36352871MaRDI QIDQ309709FDOQ309709

Authors: Tianqi Zhao, Guang Cheng, Han Liu

Publication date: 7 September 2016

Published in: The Annals of Statistics (Search for Journal in Brave)

Abstract: We consider a partially linear framework for modelling massive heterogeneous data. The major goal is to extract common features across all sub-populations while exploring heterogeneity of each sub-population. In particular, we propose an aggregation type estimator for the commonality parameter that possesses the (non-asymptotic) minimax optimal bound and asymptotic distribution as if there were no heterogeneity. This oracular result holds when the number of sub-populations does not grow too fast. A plug-in estimator for the heterogeneity parameter is further constructed, and shown to possess the asymptotic distribution as if the commonality information were available. We also test the heterogeneity among a large number of sub-populations. All the above results require to regularize each sub-estimation as though it had the entire sample size. Our general theory applies to the divide-and-conquer approach that is often used to deal with massive homogeneous data. A technical by-product of this paper is the statistical inferences for the general kernel ridge regression. Thorough numerical results are also provided to back up our theory.

Full work available at URL: https://arxiv.org/abs/1410.8570

Recommendations

zbMATH Keywords

heterogeneous data kernel ridge regression partially linear model divide-and-conquer method massive data

Mathematics Subject Classification ID

Point estimation (62F10) Asymptotic properties of parametric estimators (62F12) Asymptotic properties of nonparametric inference (62G20) Parametric tolerance and confidence regions (62F25) Ridge regression; shrinkage estimators (Lasso) (62J07)

Cites Work

Cited In (73)

Uses Software

This page was built for publication: A partially linear framework for massive heterogeneous data

Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q309709)