Social Media Topic Classification on Greek Reddit
-
Published:2024-08-26
Issue:9
Volume:15
Page:521
-
ISSN:2078-2489
-
Container-title:Information
-
language:en
-
Short-container-title:Information
Author:
Mastrokostas Charalampos1ORCID, Giarelis Nikolaos1ORCID, Karacapilidis Nikos1ORCID
Affiliation:
1. Industrial Management and Information Systems Lab, MEAD, University of Patras, 26504 Rio Patras, Greece
Abstract
Text classification (TC) is a subtask of natural language processing (NLP) that categorizes text pieces into predefined classes based on their textual content and thematic aspects. This process typically includes the training of a Machine Learning (ML) model on a labeled dataset, where each text example is associated with a specific class. Recent progress in Deep Learning (DL) enabled the development of deep neural transformer models, surpassing traditional ML ones. In any case, works of the topic classification literature prioritize high-resource languages, particularly English, while research efforts for low-resource ones, such as Greek, are limited. Taking the above into consideration, this paper presents: (i) the first Greek social media topic classification dataset; (ii) a comparative assessment of a series of traditional ML models trained on this dataset, utilizing an array of text vectorization methods including TF-IDF, classical word and transformer-based Greek embeddings; (iii) a fine-tuned GREEK-BERT-based TC model on the same dataset; (iv) key empirical findings demonstrating that transformer-based embeddings significantly increase the performance of traditional ML models, while our fine-tuned DL model outperforms previous ones. The dataset, the best-performing model, and the experimental code are made public, aiming to augment the reproducibility of this work and advance future research in the field.
Reference34 articles.
1. Deep Learning—Based Text Classification: A Comprehensive Review;Minaee;ACM Comput. Surv.,2021 2. A Survey on Text Classification: From Traditional to Deep Learning;Li;ACM Trans. Intell. Syst. Technol.,2022 3. Gasparetto, A., Marcuzzo, M., Zangari, A., and Albarelli, A. (2022). A Survey on Text Classification Algorithms: From Text to Predictions. Information, 13. 4. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., and Polosukhin, I. (2017, January 4–9). Attention Is All You Need. Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA. 5. Koutsikakis, J., Chalkidis, I., Malakasiotis, P., and Androutsopoulos, I. (2020, January 2–4). GREEK-BERT: The Greeks Visiting Sesame Street. Proceedings of the 11th Hellenic Conference on Artificial Intelligence, Athens, Greece.
|
|