Integrating Text Classification into Topic Discovery Using Semantic Embedding Models
-
Published:2023-08-31
Issue:17
Volume:13
Page:9857
-
ISSN:2076-3417
-
Container-title:Applied Sciences
-
language:en
-
Short-container-title:Applied Sciences
Author:
Lezama-Sánchez Ana Laura1ORCID, Tovar Vidal Mireya1ORCID, Reyes-Ortiz José A.2ORCID
Affiliation:
1. Faculty of Computer Science, Benemerita Universidad Autonoma de Puebla, Puebla 72570, Mexico 2. Departamento de Sistemas, Universidad Autonoma Metropolitana, Mexico City 02200, Mexico
Abstract
Topic discovery involves identifying the main ideas within large volumes of textual data. It indicates recurring topics in documents, providing an overview of the text. Current topic discovery models receive the text, with or without pre-processing, including stop word removal, text cleaning, and normalization (lowercase conversion). A topic discovery process that receives general domain text with or without processing generates general topics. General topics do not offer detailed overviews of the input text, and manual text categorization is tedious and time-consuming. Extracting topics from text with an automatic classification task is necessary to generate specific topics enriched with top words that maintain semantic relationships among them. Therefore, this paper presents an approach that integrates text classification for topic discovery from large amounts of English textual data, such as 20-Newsgroups and Reuters Corpora. We rely on integrating automatic text classification before the topic discovery process to obtain specific topics for each class with relevant semantic relationships between top words. Text classification performs a word analysis that makes up a document to decide what class or category to identify; then, the proposed integration provides latent and specific topics depicted by top words with high coherence from each obtained class. Text classification accomplishes this with a convolutional neural network (CNN), incorporating an embedding model based on semantic relationships. Topic discovery over categorized text is realized with latent Dirichlet analysis (LDA), probabilistic latent semantic analysis (PLSA), and latent semantic analysis (LSA) algorithms. An evaluation process for topic discovery over categorized text was performed based on the normalized topic coherence metric. The 20-Newsgroups corpus was classified, and twenty topics with the ten top words were identified for each class. The normalized topic coherence obtained was 0.1723 with LDA, 0.1622 with LSA, and 0.1716 with PLSA. The Reuters Corpus was also classified, and twenty and fifty topics were identified. A normalized topic coherence of 0.1441 was achieved when applying the LDA algorithm, obtaining 20 topics for each class; with LSA, the coherence was 0.1360, and with PLSA, it was 0.1436.
Subject
Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science
Reference53 articles.
1. Ramos, F., and Vélez, J. (2016). Integración de Técnicas de Procesamiento de Lenguaje Natural a Través de Servicios Web, Universidad Nacional del Centro de la provincia de Buenos Aires. 2. López López, A. (2022). Descubrimiento de Tópicos a Partir de Textos en Español Sobre Enfermedades en México, Universidad Autonoma Metropolitana. 3. Lezama-Sánchez, A.L., Tovar Vidal, M., and Reyes-Ortiz, J.A. (2022). An Approach Based on Semantic Relationship Embeddings for Text Classification. Mathematics, 10. 4. Orkphol, K., and Yang, W. (2019). Word sense disambiguation using cosine similarity collaborates with Word2vec and WordNet. Future Internet, 11. 5. Zhou, Z., Fu, B., Qiu, H., Zhang, Y., and Liu, X. (2017, January 21–23). Modeling medical texts for distributed representations based on Skip-Gram model. Proceedings of the 2017 3rd International Conference on Information Management (ICIM), Chengdu, China.
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Clinical Text Analysis with Natural Language Processing: A BERT-based Approach;2024 International Conference on Communication, Computer Sciences and Engineering (IC3SE);2024-05-09
|
|