Affiliation:
1. Department of Statistics, Harvard University, Cambridge, MA 02138, USA
Abstract
Topic modeling is a widely utilized tool in text analysis. We investigate the optimal rate for estimating a topic model. Specifically, we consider a scenario with n documents, a vocabulary of size p, and document lengths at the order N. When N≥c·p, referred to as the long-document case, the optimal rate is established in the literature at p/(Nn). However, when N=o(p), referred to as the short-document case, the optimal rate remains unknown. In this paper, we first provide new entry-wise large-deviation bounds for the empirical singular vectors of a topic model. We then apply these bounds to improve the error rate of a spectral algorithm, Topic-SCORE. Finally, by comparing the improved error rate with the minimax lower bound, we conclude that the optimal rate is still p/(Nn) in the short-document case.
Funder
National Science Foundation
Reference31 articles.
1. Hofmann, T. (1999, January 15–19). Probabilistic latent semantic indexing. Proceedings of the International ACM SIGIR Conference, Berkeley, CA, USA.
2. Latent dirichlet allocation;Blei;J. Mach. Learn. Res.,2003
3. Recent advances in text analysis;Ke;Annu. Rev. Stat. Its Appl.,2023
4. Using SVD for topic modeling;Ke;J. Am. Stat. Assoc.,2024
5. Decoupling inequalities for the tail probabilities of multivariate U-statistics;Ann. Probab.,1995