Greedy clustering of count data through a mixture of multinomial PCA
From MaRDI portal
Abstract: Count data is becoming more and more ubiquitous in a wide range of applications, with datasets growing both in size and in dimension. In this context, an increasing amount of work is dedicated to the construction of statistical models directly accounting for the discrete nature of the data. Moreover, it has been shown that integrating dimension reduction to clustering can drastically improve performance and stability. In this paper, we rely on the mixture of multinomial PCA, a mixture model for the clustering of count data, also known as the probabilistic clustering-projection model in the literature. Related to the latent Dirichlet allocation model, it offers the flexibility of topic modeling while being able to assign each observation to a unique cluster. We introduce a greedy clustering algorithm, where inference and clustering are jointly done by mixing a classification variational expectation maximization algorithm, with a branch & bound like strategy on a variational lower bound. An integrated classification likelihood criterion is derived for model selection, and a thorough study with numerical experiments is proposed to assess both the performance and robustness of the method. Finally, we illustrate the qualitative interest of the latter in a real-world application, for the clustering of anatomopathological medical reports, in partnership with expert practitioners from the Institut Curie hospital.
Recommendations
- Clustering multivariate count data via Dirichlet-multinomial network fusion
- Clustering discrete data through the multinomial mixture model
- scientific article; zbMATH DE number 1805764
- scientific article; zbMATH DE number 2231117
- A parametric mixture model for clustering multivariate binary data
- High-dimensional count data clustering based on an exponential approximation to the multinomial beta-Liouville distribution
- Clustering of contingency table and mixture model
Cites work
- scientific article; zbMATH DE number 3567782 (Why is no real title available?)
- scientific article; zbMATH DE number 3579840 (Why is no real title available?)
- scientific article; zbMATH DE number 1931821 (Why is no real title available?)
- 10.1162/jmlr.2003.3.4-5.993
- A classification EM algorithm for clustering and two stochastic versions
- Estimating the dimension of a model
- Finite mixture models
- High-dimensional data clustering
- Learning the parts of objects by non-negative matrix factorization
- Model-Based Gaussian and Non-Gaussian Clustering
- Model-based clustering and classification for data science. With applications in R
- On the equivalence between non-negative matrix factorization and probabilistic latent semantic indexing
- Probabilistic Principal Component Analysis
- The latent topic block model for the co-clustering of textual interaction data
- The stochastic topic block model for the clustering of vertices in networks with textual edges
- Variational inference for probabilistic Poisson PCA
Cited in
(5)- Clustering multivariate count data via Dirichlet-multinomial network fusion
- scientific article; zbMATH DE number 1833999 (Why is no real title available?)
- Embedded topics in the stochastic block model
- High-dimensional count data clustering based on an exponential approximation to the multinomial beta-Liouville distribution
- MoMPCA
This page was built for publication: Greedy clustering of count data through a mixture of multinomial PCA
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q131099)