Multiple Imputation Through XGBoost

From MaRDI portal
Publication:75448

DOI10.48550/ARXIV.2106.01574arXiv2106.01574MaRDI QIDQ75448FDOQ75448

Thomas Lumley, Yongshi Deng

Publication date: 3 June 2021

Abstract: The use of multiple imputation (MI) is becoming increasingly popular for addressing missing data. Although some conventional MI approaches have been well studied and have shown empirical validity, they have limitations when processing large datasets with complex data structures. Their imputation performances usually rely on the proper specification of imputation models, which requires expert knowledge of the inherent relations among variables. Moreover, these standard approaches tend to be computationally inefficient for medium and large datasets. In this paper, we propose a scalable MI framework mixgb, which is based on XGBoost, subsampling, and predictive mean matching. Our approach leverages the power of XGBoost, a fast implementation of gradient boosted trees, to automatically capture interactions and non-linear relations while achieving high computational efficiency. In addition, we incorporate subsampling and predictive mean matching to reduce bias and better account for appropriate imputation variability. The proposed framework is implemented in an R package mixgb. Supplementary materials for this article are available online.







Cited In (1)






This page was built for publication: Multiple Imputation Through XGBoost

Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q75448)