Affiliation:
1. Jiangxi University of Finance and Economics, China and East China Jiaotong University, China and Jiangxi Key Laboratory of Data and Knowledge Engineering, China
2. Jiangxi University of Finance and Economics, China and Jiangxi Key Laboratory of Data and Knowledge Engineering, China
3. College of Business, Stony Brook University, USA
Abstract
Learning topic hierarchies from a multi-domain corpus is crucial in topic modeling as it reveals valuable structural information embedded within documents. Despite the extensive literature on hierarchical topic models, effectively discovering inter-topic correlations and differences among subtopics at the same level in the topic hierarchy, obtained from multiple domains, remains an unresolved challenge. This article proposes an enhanced nested Chinese restaurant process (nCRP), nCRP+, by introducing an additional mechanism based on Chinese restaurant franchise (CRF) for aspect-sharing pattern extraction in the original nCRP. Subsequently, by employing the distribution extracted from nCRP+ as the prior distribution for topic hierarchy in the hierarchical Dirichlet processes (HDP), we develop a hierarchical topic model for multi-domain corpus, named rHDP. We describe the model with the analogy of Chinese restaurant franchise based on the central kitchen and propose a hierarchical Gibbs sampling scheme to infer the model. Our method effectively constructs well-established topic hierarchies, accurately reflecting diverse parent-child topic relationships, explicit topic aspect sharing correlations for inter-topics, and differences between these shared topics. To validate the efficacy of our approach, we conduct experiments using a renowned public dataset and an online collection of Chinese financial documents. The experimental results confirm the superiority of our method over the state-of-the-art techniques in identifying multi-domain topic hierarchies, according to multiple evaluation metrics.
Funder
National Natural Science Foundation of China
Natural Science and Foundation of Jiangxi Province
Funding Program for Academic and Technical Leaders in Major Disciplines of Jiangxi Province
Research Project for Science and Technology of Jiangxi Education Department
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Science Applications,General Business, Management and Accounting,Information Systems
Reference61 articles.
1. Amr Ahmed, Liangjie Hong, and Alexander J. Smola. 2013. Nested Chinese restaurant franchise processes: Applications to user tracking and document modeling. In Proceedings of the International Conference on Machine Learning. 1426–1434.
2. Author Tree-Structured Hierarchical Dirichlet Process
3. Exchangeability and related topics
4. Neural Relational Topic Models for Scientific Article Analysis
5. Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence