Correcting classifiers for sample selection bias in two-phase case-control studies (Q1784139): Difference between revisions

From MaRDI portal
Changed an Item
ReferenceBot (talk | contribs)
Changed an Item
 
(5 intermediate revisions by 3 users not shown)
Property / describes a project that uses
 
Property / describes a project that uses: OpenML / rank
 
Normal rank
Property / describes a project that uses
 
Property / describes a project that uses: SMOTE / rank
 
Normal rank
Property / describes a project that uses
 
Property / describes a project that uses: pROC / rank
 
Normal rank
Property / MaRDI profile type
 
Property / MaRDI profile type: MaRDI publication profile / rank
 
Normal rank
Property / full work available at URL
 
Property / full work available at URL: https://doi.org/10.1155/2017/7847531 / rank
 
Normal rank
Property / OpenAlex ID
 
Property / OpenAlex ID: W2757523618 / rank
 
Normal rank
Property / cites work
 
Property / cites work: Two-Stage Designs for Gene-Disease Association Studies with Sample Size Constraints / rank
 
Normal rank
Property / cites work
 
Property / cites work: Secondary analysis under cohort sampling designs using conditional likelihood / rank
 
Normal rank
Property / cites work
 
Property / cites work: Sample Selection Bias as a Specification Error / rank
 
Normal rank
Property / cites work
 
Property / cites work: Sample Selection Bias Correction Theory / rank
 
Normal rank
Property / cites work
 
Property / cites work: Using Sample Survey Weights in Multiple Regression Analyses of Stratified Samples / rank
 
Normal rank
Property / cites work
 
Property / cites work: A Generalization of Sampling Without Replacement From a Finite Universe / rank
 
Normal rank
Property / cites work
 
Property / cites work: Estimation of Regression Coefficients When Some Regressors Are Not Always Observed / rank
 
Normal rank
Property / cites work
 
Property / cites work: Bagging predictors / rank
 
Normal rank
Property / cites work
 
Property / cites work: Q4533353 / rank
 
Normal rank
Property / cites work
 
Property / cites work: Regression. Models, methods and applications. / rank
 
Normal rank
Property / cites work
 
Property / cites work: Random forests / rank
 
Normal rank
Property / cites work
 
Property / cites work: The elements of statistical learning. Data mining, inference, and prediction / rank
 
Normal rank
Property / cites work
 
Property / cites work: Comparing the Areas under Two or More Correlated Receiver Operating Characteristic Curves: A Nonparametric Approach / rank
 
Normal rank

Latest revision as of 17:07, 16 July 2024

scientific article
Language Label Description Also known as
English
Correcting classifiers for sample selection bias in two-phase case-control studies
scientific article

    Statements

    Correcting classifiers for sample selection bias in two-phase case-control studies (English)
    0 references
    0 references
    0 references
    0 references
    0 references
    26 September 2018
    0 references
    Summary: Epidemiological studies often utilize stratified data in which rare outcomes or exposures are artificially enriched. This design can increase precision in association tests but distorts predictions when applying classifiers on nonstratified data. Several methods correct for this so-called sample selection bias, but their performance remains unclear especially for machine learning classifiers. With an emphasis on two-phase case-control studies, we aim to assess which corrections to perform in which setting and to obtain methods suitable for machine learning techniques, especially the random forest. We propose two new resampling-based methods to resemble the original data and covariance structure: stochastic inverse-probability oversampling and parametric inverse-probability bagging. We compare all techniques for the random forest and other classifiers, both theoretically and on simulated and real data. Empirical results show that the random forest profits from only the parametric inverse-probability bagging proposed by us. For other classifiers, correction is mostly advantageous, and methods perform uniformly. We discuss consequences of inappropriate distribution assumptions and reason for different behaviors between the random forest and other classifiers. In conclusion, we provide guidance for choosing correction methods when training classifiers on biased samples. For random forests, our method outperforms state-of-the-art procedures if distribution assumptions are roughly fulfilled. We provide our implementation in the R package sambia.
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    sample selection bias
    0 references
    epidemiology
    0 references
    two-phase case-control studies
    0 references
    classifiers
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references