Correcting classifiers for sample selection bias in two-phase case-control studies (Q1784139): Difference between revisions

Summary: Epidemiological studies often utilize stratified data in which rare outcomes or exposures are artificially enriched. This design can increase precision in association tests but distorts predictions when applying classifiers on nonstratified data. Several methods correct for this so-called sample selection bias, but their performance remains unclear especially for machine learning classifiers. With an emphasis on two-phase case-control studies, we aim to assess which corrections to perform in which setting and to obtain methods suitable for machine learning techniques, especially the random forest. We propose two new resampling-based methods to resemble the original data and covariance structure: stochastic inverse-probability oversampling and parametric inverse-probability bagging. We compare all techniques for the random forest and other classifiers, both theoretically and on simulated and real data. Empirical results show that the random forest profits from only the parametric inverse-probability bagging proposed by us. For other classifiers, correction is mostly advantageous, and methods perform uniformly. We discuss consequences of inappropriate distribution assumptions and reason for different behaviors between the random forest and other classifiers. In conclusion, we provide guidance for choosing correction methods when training classifiers on biased samples. For random forests, our method outperforms state-of-the-art procedures if distribution assumptions are roughly fulfilled. We provide our implementation in the R package sambia.

0 references

Mathematics Subject Classification ID

92B15

0 references

0 references

0 references

0 references

0 references

sample selection bias

0 references

epidemiology

0 references

two-phase case-control studies

0 references

classifiers

0 references

Wikidata QID

Q47193912

0 references

describes a project that uses

0 references

0 references

0 references

0 references

0 references

0 references

0 references

0 references

0 references

MaRDI publication profile

0 references

full work available at URL

https://doi.org/10.1155/2017/7847531

0 references

OpenAlex ID

W2757523618

0 references

cites work

Two-Stage Designs for Gene-Disease Association Studies with Sample Size Constraints

0 references

Secondary analysis under cohort sampling designs using conditional likelihood

0 references

Sample Selection Bias as a Specification Error

0 references

Sample Selection Bias Correction Theory

0 references

Using Sample Survey Weights in Multiple Regression Analyses of Stratified Samples

0 references

A Generalization of Sampling Without Replacement From a Finite Universe

0 references

Estimation of Regression Coefficients When Some Regressors Are Not Always Observed

0 references

Bagging predictors

0 references

Q4533353

0 references

Regression. Models, methods and applications.

0 references

Random forests

0 references

The elements of statistical learning. Data mining, inference, and prediction

0 references

Comparing the Areas under Two or More Correlated Receiver Operating Characteristic Curves: A Nonparametric Approach

0 references

Sitelinks

Mathematics(1 entry)

mardi Publication:1784139

@@ Property / describes a project that uses @@
+OpenML
@@ Property / describes a project that uses: OpenML / rank @@
+Normal rank
@@ Property / describes a project that uses @@
+SMOTE
@@ Property / describes a project that uses: SMOTE / rank @@
+Normal rank
@@ Property / describes a project that uses @@
+pROC
@@ Property / describes a project that uses: pROC / rank @@
+Normal rank
@@ Property / MaRDI profile type @@
+MaRDI publication profile
@@ Property / MaRDI profile type: MaRDI publication profile / rank @@
+Normal rank
@@ Property / full work available at URL @@
+https://doi.org/10.1155/2017/7847531
+Normal rank
@@ Property / OpenAlex ID @@
+W2757523618
@@ Property / OpenAlex ID: W2757523618 / rank @@
+Normal rank
@@ Property / cites work @@
+Two-Stage Designs for Gene-Disease Association Studies with Sample Size Constraints
+Normal rank
@@ Property / cites work @@
+Secondary analysis under cohort sampling designs using conditional likelihood
+Normal rank
@@ Property / cites work @@
+Sample Selection Bias as a Specification Error
@@ Property / cites work: Sample Selection Bias as a Specification Error / rank @@
+Normal rank
@@ Property / cites work @@
+Sample Selection Bias Correction Theory
@@ Property / cites work: Sample Selection Bias Correction Theory / rank @@
+Normal rank
@@ Property / cites work @@
+Using Sample Survey Weights in Multiple Regression Analyses of Stratified Samples
+Normal rank
@@ Property / cites work @@
+A Generalization of Sampling Without Replacement From a Finite Universe
+Normal rank
@@ Property / cites work @@
+Estimation of Regression Coefficients When Some Regressors Are Not Always Observed
+Normal rank
@@ Property / cites work @@
+Bagging predictors
@@ Property / cites work: Bagging predictors / rank @@
+Normal rank
@@ Property / cites work @@
+Q4533353
@@ Property / cites work: Q4533353 / rank @@
+Normal rank
@@ Property / cites work @@
+Regression. Models, methods and applications.
@@ Property / cites work: Regression. Models, methods and applications. / rank @@
+Normal rank
@@ Property / cites work @@
+Random forests
@@ Property / cites work: Random forests / rank @@
+Normal rank
@@ Property / cites work @@
+The elements of statistical learning. Data mining, inference, and prediction
+Normal rank
@@ Property / cites work @@
+Comparing the Areas under Two or More Correlated Receiver Operating Characteristic Curves: A Nonparametric Approach
+Normal rank