Social Media Topic Classification on Greek Reddit

Author:

Mastrokostas Charalampos1ORCID,Giarelis Nikolaos1ORCID,Karacapilidis Nikos1ORCID

Affiliation:

1. Industrial Management and Information Systems Lab, MEAD, University of Patras, 26504 Rio Patras, Greece

Abstract

Text classification (TC) is a subtask of natural language processing (NLP) that categorizes text pieces into predefined classes based on their textual content and thematic aspects. This process typically includes the training of a Machine Learning (ML) model on a labeled dataset, where each text example is associated with a specific class. Recent progress in Deep Learning (DL) enabled the development of deep neural transformer models, surpassing traditional ML ones. In any case, works of the topic classification literature prioritize high-resource languages, particularly English, while research efforts for low-resource ones, such as Greek, are limited. Taking the above into consideration, this paper presents: (i) the first Greek social media topic classification dataset; (ii) a comparative assessment of a series of traditional ML models trained on this dataset, utilizing an array of text vectorization methods including TF-IDF, classical word and transformer-based Greek embeddings; (iii) a fine-tuned GREEK-BERT-based TC model on the same dataset; (iv) key empirical findings demonstrating that transformer-based embeddings significantly increase the performance of traditional ML models, while our fine-tuned DL model outperforms previous ones. The dataset, the best-performing model, and the experimental code are made public, aiming to augment the reproducibility of this work and advance future research in the field.

Publisher

MDPI AG

Reference34 articles.

1. Deep Learning—Based Text Classification: A Comprehensive Review;Minaee;ACM Comput. Surv.,2021

2. A Survey on Text Classification: From Traditional to Deep Learning;Li;ACM Trans. Intell. Syst. Technol.,2022

3. Gasparetto, A., Marcuzzo, M., Zangari, A., and Albarelli, A. (2022). A Survey on Text Classification Algorithms: From Text to Predictions. Information, 13.

4. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., and Polosukhin, I. (2017, January 4–9). Attention Is All You Need. Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA.

5. Koutsikakis, J., Chalkidis, I., Malakasiotis, P., and Androutsopoulos, I. (2020, January 2–4). GREEK-BERT: The Greeks Visiting Sesame Street. Proceedings of the 11th Hellenic Conference on Artificial Intelligence, Athens, Greece.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3