Data lake management using topic modeling techniques-Reference-Cited by-同舟云学术

Data lake management using topic modeling techniques

Published:2024-01-01 Issue: Volume:3 Page:282
ISSN:2953-4917
Container-title:Data and Metadata
language:
Short-container-title:Data and Metadata

Author:

Cherradi Mohamed,El Haddadi Anass

Abstract

With the rapid rise of information technology, the amount of unstructured data from the data lake is rapidly growing and has become a great challenge in analyzing, organizing and automatically classifying in order to derive the meaningful information for a data-driven business. The scientific document has unlabeled text, so it's difficult to properly link it to a topic model. However, crafting a topic perception for a heterogeneous dataset within the domain of big data lakes presents a complex issue. The manual classification of text documents requires significant financial and human resources. Yet, employing topic modeling techniques could streamline this process, enhancing our understanding of word meanings and potentially reducing the resource burden. This paper presents a comparative study on metadata-based classification of scientific documents dataset, applying the two well-known machine learning-based topic modelling approaches, Latent Dirichlet Analysis (LDA) and Latent Semantic Allocation (LSA). To assess the effectiveness of our proposals, we conducted a thorough examination primarily centred on crucial assessment metrics, including coherence scores, perplexity, and log-likelihood. This evaluation was carried out on a scientific publications corpus, according to information from the title, abstract, keywords, authors, affiliation, and other metadata aspects. Results of these experiments highlight the superior performance of LDA over LSA, evidenced by a remarkable coherence value of (0,884) in contrast to LSA's (0,768)

Publisher

Salud, Ciencia y Tecnologia

Reference36 articles.

1. Boyd-Graber, J., Hu, Y., & Mimno, D. (2017). Applications of Topic Models. Foundations and Trends® in Information Retrieval, 11, 143‑296. https://doi.org/10.1561/1500000030

2. Boussaadi, S., Aliane, D. H., & Abdeldjalil, P. O. (2020). The Researchers Profile with Topic Modeling. IEEE International Conference on Electronics, Control, Optimization and Computer Science (ICECOCS), 1‑6. https://doi.org/10.1109/ICECOCS50124.2020.9314588

3. Kherwa, P., & Bansal, P. (2018). Topic Modeling : A Comprehensive Review. ICST Transactions on Scalable Information Systems, 7, 159623. https://doi.org/10.4108/eai.13-7-2018.159623

4. Anupriya, P., & Karpagavalli, S. (2015). LDA based topic modeling of journal abstracts. 2015 International Conference on Advanced Computing and Communication Systems, 1‑5. https://doi.org/10.1109/ICACCS.2015.7324058

5. Newman, D., Noh, Y., Talley, E., Karimi, S., & Baldwin, T. (2010). Evaluating topic models for digital libraries. Proceedings of the 10th Annual Joint Conference on Digital Libraries, 215‑224. https://doi.org/10.1145/1816123.1816156