Validation of scientific topic models using graph analysis and corpus metadata-Reference-Cited by-同舟云学术

Validation of scientific topic models using graph analysis and corpus metadata

Published:2022-03-30 Issue:9 Volume:127 Page:5441-5458
ISSN:0138-9130
Container-title:Scientometrics
language:en
Short-container-title:Scientometrics

Author:

Vázquez Manuel A.^ORCID,Pereira-Delgado Jorge,Cid-Sueiro Jesús^ORCID,Arenas-García Jerónimo^ORCID

Abstract

AbstractProbabilistic topic modeling algorithms like Latent Dirichlet Allocation (LDA) have become powerful tools for the analysis of large collections of documents (such as papers, projects, or funding applications) in science, technology an innovation (STI) policy design and monitoring. However, selecting an appropriate and stable topic model for a specific application (by adjusting the hyperparameters of the algorithm) is not a trivial problem. Common validation metrics like coherence or perplexity, which are focused on the quality of topics, are not a good fit in applications where the quality of the document similarity relations inferred from the topic model is especially relevant. Relying on graph analysis techniques, the aim of our work is to state a new methodology for the selection of hyperparameters which is specifically oriented to optimize the similarity metrics emanating from the topic model. In order to do this, we propose two graph metrics: the first measures the variability of the similarity graphs that result from different runs of the algorithm for a fixed value of the hyperparameters, while the second metric measures the alignment between the graph derived from the LDA model and another obtained using metadata available for the corresponding corpus. Through experiments on various corpora related to STI, it is shown that the proposed metrics provide relevant indicators to select the number of topics and build persistent topic models that are consistent with the metadata. Their use, which can be extended to other topic models beyond LDA, could facilitate the systematic adoption of this kind of techniques in STI policy analysis and design.

Funder

Horizon 2020 Framework Programme

Ministerio de Ciencia, Innovación y Universidades

Universidad Carlos III

Publisher

Springer Science and Business Media LLC

Subject

Library and Information Sciences,Computer Science Applications,General Social Sciences

Link

https://link.springer.com/content/pdf/10.1007/s11192-022-04318-5.pdf

Reference41 articles.

1. Adebiyi, A., Ogunleye, O. M., Adebiyi, M., & Okesola, J. (2019). A comparative analysis of tf-idf, lsi and lda in semantic information retrieval approach for paper-reviewer assignment. Journal of Engineering and Applied Sciences, 14(10), 3378–3382.

2. Agerri, R., Bermudez, J., & Rigau, G. (2014). Ixa pipeline: Efficient and ready to use multilingual nlp tools. In LREC, 2014, 3823–3828.

3. Agrawal, A., Fu, W., & Menzies, T. (2018). What is wrong with topic modeling? And how to fix it using search-based software engineering. Information and Software Technology, 98, 74–88.

4. Ammar, W., Groeneveld, D., Bhagavatula, C., Beltagy, I., Crawford, M., Downey, D., Dunkelberger, J., Elgohary, A., Feldman, S., Ha, V., Kinney, R., Kohlmeier, S., Lo, K., Murray, T., Ooi, H. H., Peters, M., Power, J., Skjonsberg, S., Wang, L. L., Wilhelm, C., Yuan, Z., van Zuylen, & M., Etzioni, O. (2018) Construction of the literature graph in semantic scholar. In NAACL

5. Badenes-Olmedo, C., Redondo-Garcia, J. L., & Corcho, O. (2017). Distributing text mining tasks with librAIry. In Proceedings of the 2017 ACM symposium on document engineering, DocEng ’17 (pp. 63–66). ACM.

Cited by 6 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Topic-based engagement analysis: Focusing on hotel industry Twitter accounts;Tourism Management;2025-02

2. Graph-Driven Topic Discovery: Extracting Insights from Citation Networks;2024 International Conference on Knowledge Engineering and Communication Systems (ICKECS);2024-04-18

3. An Evaluation Research of Topic Extraction Results Based on External Metadata;2023 4th International Conference on Information Science and Education (ICISE-IE);2023-12-15

4. Sensitivity Analysis of Text Vectorization Techniques for Failure Analysis: A Latent Dirichlet Allocation and Generalized Variational Autoencoder Approach;2023-10-27

5. Do Patients Assess Physicians Differently in Video Visits as Compared with In-Person Visits? Insights from Text-Mining Online Physician Reviews;Telemedicine and e-Health;2023-10-01