Affiliation:
1. HSE University, Moscow, Russia
Abstract
The steady increase in the popularity of social media as a means of communication actualizes methodological issues related to processing of short texts with less semantic context than large corpora, which are widely used for training and testing machine learning models for textual data. Topic modeling, an unsupervised machine learning technique aimed at aggregating texts into topic clusters, has many academic and practical applications where information on true groupings of texts is not available. However, the performance of topic modeling algorithms may be limited by requirement of a sufficient semantic context for a high-quality numerical representation of a unit of text, which may not be derived effectively from a short document. This paper is dedicated to discussing 6 different approaches to topic modeling, comparing their performance on a set of Russian-language comments on TikTok and formally evaluating their performance based on speed and coherence of the resulting topics.
Publisher
Federal Center of Theoretical and Applied Sociology of the Russian Academy of Sciences (FCTAS RAS)
Reference44 articles.
1. Brookes G., McEnery T. The utility of topic modelling for discourse studies: A critical evaluation, Discourse Studies, 2019, vol. 21, no. 1, p. 3–21. DOI: 10.1177/1461445618814032.
2. Godin F., Slavkoviki V., De Neve W. et al. Using topic models for Twitter hashtag recommendation, Proceedings of the 22nd International Conference on World Wide Web. ACM, Rio de Janeiro, 2013, p. 593-596. DOI: 10.1145/2487788.2488002.
3. Asmussen C.B., Møller C. Smart literature review: a practical topic modelling approach to exploratory literature review, Journal of Big Data, 2019, vol. 6, no 1, p. 93. DOI: 10.1186/s40537-019-0255-7.
4. Hoseini M., Melo P., Benevenuto F. et al. On the Globalization of the QAnon Conspiracy Theory Through Telegram, Proceedings of the 15th ACM Web Science Conference. ACM: Austin, 2023, p. 75-85. DOI: 10.1145/3578503.3583603.
5. Koltsova O., Maslinsky K. Revealing the thematic structure of the Russian blogosphere: automatic methods of text analysis (in Russian), Sotsiologiya 4M (Sociology: methodology, methods, mathematical modeling), 2013, no. 36, p. 113-139.