Providing access to confidential research data through synthesis and verification: an application to data on employees of the U.S. federal government
From MaRDI portal
Publication:1624838
Abstract: Data stewards seeking to provide access to large-scale social science data face a difficult challenge. They have to share data in ways that protect privacy and confidentiality, are informative for many analyses and purposes, and are relatively straightforward to use by data analysts. One approach suggested in the literature is that data stewards generate and release synthetic data, i.e., data simulated from statistical models, while also providing users access to a verification server that allows them to assess the quality of inferences from the synthetic data. We present an application of the synthetic data plus verification server approach to longitudinal data on employees of the U.S. federal government. As part of the application, we present a novel model for generating synthetic career trajectories, as well as strategies for generating high dimensional, longitudinal synthetic datasets. We also present novel verification algorithms for regression coefficients that satisfy differential privacy. We illustrate the integrated use of synthetic data plus verification via analysis of differentials in pay by race. The integrated system performs as intended, allowing users to explore the synthetic data for potential pay differentials and learn through verifications which findings in the synthetic data hold up and which do not. The analysis on the confidential data reveals pay differentials across races not documented in published studies.
Recommendations
- Releasing multiply-imputed synthetic data generated in two stages to protect confidentiality
- Releasing Multiply Imputed, Synthetic Public use Microdata: An Illustration and Empirical Study
- New data dissemination approaches in old Europe -- synthetic datasets for a German establishment survey
- Verification servers: enabling analysts to assess the quality of inferences from public use data
- A new approach for disclosure control in the IAB establishment panel -- multiple imputation for a better data access
Cites work
- scientific article; zbMATH DE number 5485439 (Why is no real title available?)
- scientific article; zbMATH DE number 1834445 (Why is no real title available?)
- scientific article; zbMATH DE number 5485574 (Why is no real title available?)
- A new approach for disclosure control in the IAB establishment panel -- multiple imputation for a better data access
- Bayesian multiscale multiple imputation with implications for data confidentiality
- Differential Privacy
- Inference using noisy degrees: differentially private \(\beta\)-model and synthetic graphs
- Multiple imputation for sharing precise geographies in public use data
- Releasing Multiply Imputed, Synthetic Public use Microdata: An Illustration and Empirical Study
- Synthetic datasets for statistical disclosure control. Theory and implementation
- The Multiple Adaptations of Multiple Imputation
- The algorithmic foundations of differential privacy
- Verification servers: enabling analysts to assess the quality of inferences from public use data
- Wage dispersion, returns to skill, and black-white wage differentials
Cited in
(5)- A Feasibility Study of Differentially Private Summary Statistics and Regression Analyses with Evaluations on Administrative and Survey Data
- Bayesian bootstraps for massive data
- 30 years of synthetic data
- Reproducibility and transparency versus privacy and confidentiality: reflections from a data editor
- Comparative study of differentially private data synthesis methods
This page was built for publication: Providing access to confidential research data through synthesis and verification: an application to data on employees of the U.S. federal government
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q1624838)