Deep mixtures of unigrams for uncovering topics in textual data
From MaRDI portal
(Redirected from Publication:129168)
Abstract: Mixtures of Unigrams are one of the simplest and most efficient tools for clustering textual data, as they assume that documents related to the same topic have similar distributions of terms, naturally described by Multinomials. When the classification task is particularly challenging, such as when the document-term matrix is high-dimensional and extremely sparse, a more composite representation can provide better insight on the grouping structure. In this work, we developed a deep version of mixtures of Unigrams for the unsupervised classification of very short documents with a large number of terms, by allowing for models with further deeper latent layers; the proposal is derived in a Bayesian framework. The behaviour of the Deep Mixtures of Unigrams is empirically compared with that of other traditional and state-of-the-art methods, namely -means with cosine distance, -means with Euclidean distance on data transformed according to Semantic Analysis, Partition Around Medoids, Mixture of Gaussians on semantic-based transformed data, hierarchical clustering according to Ward's method with cosine dissimilarity, Latent Dirichlet Allocation, Mixtures of Unigrams estimated via the EM algorithm, Spectral Clustering and Affinity Propagation clustering. The performance is evaluated in terms of both correct classification rate and Adjusted Rand Index. Simulation studies and real data analysis prove that going deep in clustering such data highly improves the classification accuracy.
Recommendations
- Mixtures of Dirichlet-multinomial distributions for supervised and unsupervised classification of short text data
- Two-way Poisson mixture models for simultaneous document classification and word clustering
- Cluster-based sparse topical coding for topic mining and document clustering
- scientific article; zbMATH DE number 2086540
- scientific article; zbMATH DE number 1302160
Cites work
- scientific article; zbMATH DE number 3567782 (Why is no real title available?)
- 10.1162/jmlr.2003.3.4-5.993
- Adjusting for chance clustering comparison measures
- Clustering by passing messages between data points
- Deep Gaussian mixture models
- Dimensionally reduced model-based clustering through mixtures of factor mixture analyzers
- Finding Groups in Data
- Finite mixture models
- Heteroscedastic factor mixture analysis
- Least squares quantization in PCM
- Model-Based Clustering, Discriminant Analysis, and Density Estimation
- Modelling high-dimensional data by mixtures of factor analyzers
- Text classification from labeled and unlabeled documents using EM
- Ward's hierarchical agglomerative clustering method: which algorithms implement Ward's criterion?
Cited in
(5)- Study on text representation method based on deep learning and topic information
- Dealing with overdispersion in multivariate count data
- Mixtures of Dirichlet-multinomial distributions for supervised and unsupervised classification of short text data
- A mixture model approach to spectral clustering and application to textual data
- deepMOU
This page was built for publication: Deep mixtures of unigrams for uncovering topics in textual data
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q129168)