Abstract
AbstractThis chapter focuses on topic modelling, i.e. the automatic extraction of topics or themes from a corpus. Topic modelling goes a step further than keywords in the automatic identification of the contents of a corpus. Two types of approaches are considered, discussed, and contrasted. On the one hand, those that I dub “traditional”, as illustrated by the LDA and NMF algorithms, and, on the other, embeddings-based approaches, which largely surpass the former in the quality of results. The weakest aspect of topic modelling tools in general is the lack actual labels for the extracted topics, since all they return is a set of loosely related keywords that collectively identify the topic. In the last experiment I describe an approach that uses the power of Large Language Models to effectively derive high-quality labels for the extracted topics.
Publisher
Springer Nature Switzerland
Reference29 articles.
1. Angelov, Dimo. 2020. Top2Vec: Distributed Representations of Topics.
2. Anupriya, P., and S. Karpagavalli. 2015. LDA Based Topic Modeling of Journal Abstracts. In 2015 International Conference on Advanced Computing and Communication Systems: 1–5. https://doi.org/10.1109/ICACCS.2015.7324058.
3. Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet Allocation. The Journal of Machine Learning Research 3: 993–1022.
4. Blei, David M., and John D. Lafferty. 2006. Dynamic Topic Models. In Proceedings of the 23rd International Conference on Machine Learning, 113–120. ICML ’06. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/1143844.1143859.
5. Cao, Qiang, Xian Cheng, and Shaoyi Liao. 2023. A Comparison Study of Topic Modeling Based Literature Analysis by Using Full Texts and Abstracts of Scientific Articles: A Case of COVID-19 Research. Library Hi Tech 41: 543–569. https://doi.org/10.1108/lht-03-2022-0144.