Affiliation:
1. College of Information Science and Engineering, Xinjiang University, Urumqi 830046, China
2. Xinjiang Multilingual Information Technology Laboratory, College of Information Science and Engineering, Xinjiang University, Urumqi 830046, China
Abstract
Topic models can extract consistent themes from large corpora for research purposes. In recent years, the combination of pretrained language models and neural topic models has gained attention among scholars. However, this approach has some drawbacks: in short texts, the quality of the topics obtained by the models is low and incoherent, which is caused by the reduced word frequency (insufficient word co-occurrence) in short texts compared to long texts. To address these issues, we propose a neural topic model based on SBERT and data augmentation. First, our proposed easy data augmentation (EDA) method with keyword combination helps overcome the sparsity problem in short texts. Then, the attention mechanism is used to focus on keywords related to the topic and reduce the impact of noise words. Next, the SBERT model is trained on a large and diverse dataset, which can generate high-quality semantic information vectors for short texts. Finally, we perform feature fusion on the augmented data that have been weighted by an attention mechanism with the high-quality semantic information obtained. Then, the fused features are input into a neural topic model to obtain high-quality topics. The experimental results on an English public dataset show that our model generates high-quality topics, with the average scores improving by 2.5% for topic coherence and 1.2% for topic diversity compared to the baseline model.
Funder
Key Projects of Scientific Research Program of Xinjiang Universities Foundation of China
Subject
Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science
Reference48 articles.
1. Latent dirichlet allocation;Blei;J. Mach. Learn. Res.,2003
2. Chaudhary, Y., Gupta, P., Saxena, K., Kulkarni, V., Runkler, T., and Schütze, H. (2020). TopicBERT for energy efficient document classification. arXiv.
3. Sentiment word co-occurrence and knowledge pair feature extraction based LDA short text clustering algorithm;Wu;J. Intell. Inf. Syst.,2021
4. Indexing by latent semantic analysis;Deerwester;J. Am. Soc. Inf. Sci.,1990
5. Hofmann, T. (1999, January 15–19). Probabilistic Latent Semantic Indexing. Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, CA, USA.
Cited by
6 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献