A clustering-based topic model using word networks and word embeddings


Mu Wenchuan,Lim Kwan HuiORCID,Liu Junhua,Karunasekera Shanika,Falzon Lucia,Harwood Aaron


AbstractOnline social networking services like Twitter are frequently used for discussions on numerous topics of interest, which range from mainstream and popular topics (e.g., music and movies) to niche and specialized topics (e.g., politics). Due to the popularity of such services, it is a challenging task to automatically model and determine the numerous discussion topics given the large amount of tweets. Adding on this complexity is the need to identify these topics with the absence of prior knowledge about both the types and number of topics, while having the requirement of the relevant technical expertise to tune the numerous parameters for the various models. To address this challenge, we develop the Clustering-based Topic Modelling (ClusTop) algorithm that first constructs different types of word networks based on different types of n-grams co-occurrence and word embedding distances. Using these word networks, ClusTop is then able to automatically determine the discussion topics using community detection approaches. In contrast to traditional topic models, ClusTop does not require the tuning or setting of numerous parameters and instead uses community detection approaches to automatically determine the appropriate number of topics. The ClusTop algorithm is also able to capture the syntactic meaning in tweets via the use of bigrams, trigrams, other word combinations and word embedding techniques in constructing the word network graph, and utilizes edge weights based on word embedding. Using three Twitter datasets with labelled crises and events as topics, we show that ClusTop outperforms various traditional baselines in terms of topic coherence, pointwise mutual information, precision, recall and F-score.


Defence Science and Technology Group

Singapore University of Technology and Design


Springer Science and Business Media LLC


Information Systems and Management,Computer Networks and Communications,Hardware and Architecture,Information Systems

Reference89 articles.

1. Statistics IL. Twitter Usage Statistics. 2016. http://www.internetlivestats.com/twitter-statistics/.

2. Kumar S, Morstatter F, Liu H. Twitter Data Analytics. New York: Springer; 2013.

3. Liao Y, Moshtaghi M, Han B, Karunasekera S, Kotagiri R, Baldwin T, Harwood A, Pattison P. Mining Micro-Blogs: Opportunities and Challenges. Social Networks: Computational Aspects and Mining. In: London in the Computer Communications and Networks series. Springer: New York; 2011.

4. Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R. Analysing how people orient to and spread rumours in social media by looking at conversational threads. J Am Soc Inf Sci. 1990;41(6):391.

5. Hofmann T. Probabilistic latent semantic analysis. In: Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI’99). 2012. p. 289–296.

Cited by 7 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献








Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3