Affiliation:
1. School of Business and Economics, Hochschule für Wirtschaft und Recht Berlin, Badensche Strasse 52, 10825 Berlin, Germany
Abstract
Topic analysis represents each document in a text corpus in a low-dimensional latent topic space. In some cases, the desired topic representation is subject to specific requirements or guidelines constituting side information. For instance, sustainability-aware investors might be interested in automatically assessing aspects of firm sustainability based on the textual content of its corporate reports, focusing on the established 17 UN sustainability goals. The main corpus consists of the corporate report texts, while the texts containing the definitions of the 17 UN sustainability goals represent the side information. Under the assumption that both text corpora share a common low-dimensional subspace, we propose representing them in such a space via directed topic extraction using matrix co-factorization. Both the main and the side text corpora are first represented as term–context matrices, which are then jointly decomposed into word–topic and topic–context matrices. The word–topic matrix is common to both text corpora, whereas the topic–context matrices contain specific representations in the shared topic space. A nuisance parameter, which allows us to shift the focus between the error minimization of individual factorization terms, controls the extent to which the side information is taken into account. With our approach, documents from the main and the side corpora can be related to each other in the resulting latent topic space. That is, the corporate reports are represented in the same latent topic space as the descriptions of the 17 UN sustainability goals, enabling a structured automatic sustainability assessment of the textual report’s content. We provide an algorithm for such directed topic extraction and propose techniques for visualizing and interpreting the results.