Using SVD for Topic Modeling
From MaRDI portal
Abstract: The probabilistic topic model imposes a low-rank structure on the expectation of the corpus matrix. Therefore, singular value decomposition (SVD) is a natural tool of dimension reduction. We propose an SVD-based method for estimating a topic model. Our method constructs an estimate of the topic matrix from only a few leading singular vectors of the corpus matrix, and has a great advantage in memory use and computational cost for large-scale corpora. The core ideas behind our method include a pre-SVD normalization to tackle severe word frequency heterogeneity, a post-SVD normalization to create a low-dimensional word embedding that manifests a simplex geometry, and a post-SVD procedure to construct an estimate of the topic matrix directly from the embedded word cloud. We provide the explicit rate of convergence of our method. We show that our method attains the optimal rate in the case of long and moderately long documents, and it improves the rates of existing methods in the case of short documents. The key of our analysis is a sharp row-wise large-deviation bound for empirical singular vectors, which is technically demanding to derive and potentially useful for other problems. We apply our method to a corpus of Associated Press news articles and a corpus of abstracts of statistical papers.
Cites work
- scientific article; zbMATH DE number 7370606 (Why is no real title available?)
- scientific article; zbMATH DE number 7307472 (Why is no real title available?)
- 10.1162/jmlr.2003.3.4-5.993
- A fast algorithm with minimax optimal guarantees for topic models with an unknown number of topics
- Asymptotic Theory of Eigenvectors for Random Matrices With Diverging Spikes
- Coauthorship and citation networks for statisticians
- Entrywise eigenvector analysis of random matrices with low expected rank
- Factor Models for High-Dimensional Tensor Time Series
- Fast community detection by SCORE
- Matrix Analysis
- Nonnegative matrix factorization via archetypal analysis
- On tail probabilities for martingales
- On the equivalence between non-negative matrix factorization and probabilistic latent semantic indexing
- Poisson random fields for dynamic feature models
- The Rotation of Eigenvectors by a Perturbation. III
- Topic Modeling on Triage Notes With Semiorthogonal Nonnegative Matrix Factorization
Cited in
(3)
This page was built for publication: Using SVD for Topic Modeling
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q144914)