rHDP: An Aspect Sharing-Enhanced Hierarchical Topic Model for Multi-Domain Corpus

Author:

Zhang Yitao1ORCID,Wan Changxuan2ORCID,Xiao Keli3ORCID,Wan Qizhi2ORCID,Liu Dexi2ORCID,Liu Xiping2ORCID

Affiliation:

1. Jiangxi University of Finance and Economics, China and East China Jiaotong University, China and Jiangxi Key Laboratory of Data and Knowledge Engineering, China

2. Jiangxi University of Finance and Economics, China and Jiangxi Key Laboratory of Data and Knowledge Engineering, China

3. College of Business, Stony Brook University, USA

Abstract

Learning topic hierarchies from a multi-domain corpus is crucial in topic modeling as it reveals valuable structural information embedded within documents. Despite the extensive literature on hierarchical topic models, effectively discovering inter-topic correlations and differences among subtopics at the same level in the topic hierarchy, obtained from multiple domains, remains an unresolved challenge. This article proposes an enhanced nested Chinese restaurant process (nCRP), nCRP+, by introducing an additional mechanism based on Chinese restaurant franchise (CRF) for aspect-sharing pattern extraction in the original nCRP. Subsequently, by employing the distribution extracted from nCRP+ as the prior distribution for topic hierarchy in the hierarchical Dirichlet processes (HDP), we develop a hierarchical topic model for multi-domain corpus, named rHDP. We describe the model with the analogy of Chinese restaurant franchise based on the central kitchen and propose a hierarchical Gibbs sampling scheme to infer the model. Our method effectively constructs well-established topic hierarchies, accurately reflecting diverse parent-child topic relationships, explicit topic aspect sharing correlations for inter-topics, and differences between these shared topics. To validate the efficacy of our approach, we conduct experiments using a renowned public dataset and an online collection of Chinese financial documents. The experimental results confirm the superiority of our method over the state-of-the-art techniques in identifying multi-domain topic hierarchies, according to multiple evaluation metrics.

Funder

National Natural Science Foundation of China

Natural Science and Foundation of Jiangxi Province

Funding Program for Academic and Technical Leaders in Major Disciplines of Jiangxi Province

Research Project for Science and Technology of Jiangxi Education Department

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Science Applications,General Business, Management and Accounting,Information Systems

Reference61 articles.

1. Amr Ahmed, Liangjie Hong, and Alexander J. Smola. 2013. Nested Chinese restaurant franchise processes: Applications to user tracking and document modeling. In Proceedings of the International Conference on Machine Learning. 1426–1434.

2. Author Tree-Structured Hierarchical Dirichlet Process

3. Exchangeability and related topics

4. Neural Relational Topic Models for Scientific Article Analysis

5. Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3