Interpretable Topic Extraction and Word Embedding Learning Using Non-Negative Tensor DEDICOM
-
Published:2021-01-19
Issue:1
Volume:3
Page:123-167
-
ISSN:2504-4990
-
Container-title:Machine Learning and Knowledge Extraction
-
language:en
-
Short-container-title:MAKE
Author:
Hillebrand LarsORCID,
Biesner DavidORCID,
Bauckhage Christian,
Sifa Rafet
Abstract
Unsupervised topic extraction is a vital step in automatically extracting concise contentual information from large text corpora. Existing topic extraction methods lack the capability of linking relations between these topics which would further help text understanding. Therefore we propose utilizing the Decomposition into Directional Components (DEDICOM) algorithm which provides a uniquely interpretable matrix factorization for symmetric and asymmetric square matrices and tensors. We constrain DEDICOM to row-stochasticity and non-negativity in order to factorize pointwise mutual information matrices and tensors of text corpora. We identify latent topic clusters and their relations within the vocabulary and simultaneously learn interpretable word embeddings. Further, we introduce multiple methods based on alternating gradient descent to efficiently train constrained DEDICOM algorithms. We evaluate the qualitative topic modeling and word embedding performance of our proposed methods on several datasets, including a novel New York Times news dataset, and demonstrate how the DEDICOM algorithm provides deeper text analysis than competing matrix factorization approaches.
Funder
Bundesministerium für Bildung, Wissenschaft und Forschung
Subject
General Economics, Econometrics and Finance
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献