Clustering and variable selection for categorical multivariate data
From MaRDI portal
Abstract: This article investigates unsupervised classification techniques for categorical multivariate data. The study employs multivariate multinomial mixture modeling, which is a type of model particularly applicable to multilocus genotypic data. A model selection procedure is used to simultaneously select the number of components and the relevant variables. A non-asymptotic oracle inequality is obtained, leading to the proposal of a new penalized maximum likelihood criterion. The selected model proves to be asymptotically consistent under weak assumptions on the true probability underlying the observations. The main theoretical result obtained in this study suggests a penalty function defined to within a multiplicative parameter. In practice, the data-driven calibration of the penalty function is made possible by slope heuristics. Based on simulated data, this procedure is found to improve the performance of the selection procedure with respect to classical criteria such as BIC and AIC. The new criterion provides an answer to the question "Which criterion for which sample size?" Examples of real dataset applications are also provided.
Recommendations
- Variable selection in model-based clustering using multilocus genotype data
- Variable selection methods for model-based clustering
- Variable selection in clustering via Dirichlet process mixture models
- Penalized model-based clustering with application to variable selection
- Variable Selection for Clustering with Gaussian Mixture Models
Cites work
- scientific article; zbMATH DE number 3567782 (Why is no real title available?)
- A non asymptotic penalized criterion for Gaussian mixture model selection
- Clustering criteria for discrete data and latent class models
- Clustering for binary data and mixture models—choice of the model
- Computational and Inferential Difficulties with Mixture Posterior Distributions
- Concentration inequalities and model selection. Ecole d'Eté de Probabilités de Saint-Flour XXXIII -- 2003.
- Data-driven penalty calibration: a case study for Gaussian mixture model selection
- Exploratory latent structure analysis using both identifiable and unidentifiable models
- Finite mixture models
- Minimal penalties for Gaussian model selection
- Model selection with data-oriented penalty
- Rates of convergence for the Gaussian mixture sieve.
- Variable selection in model-based clustering using multilocus genotype data
Cited in
(6)- Variable selection methods for model-based clustering
- Nonparametric finite translation hidden Markov models and extensions
- Variable selection for mixed data clustering: application in human population genomics
- Selection of Variables for Cluster Analysis and Classification Rules
- A hierarchical Bayesian approach for examining heterogeneity in choice decisions
- Efficient mixture model for clustering of sparse high dimensional binary data
This page was built for publication: Clustering and variable selection for categorical multivariate data
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q367219)