Classification Performance Comparison of BERT and IndoBERT on SelfReport of COVID-19 Status on Social Media-Reference-Cited by-同舟云学术

Classification Performance Comparison of BERT and IndoBERT on SelfReport of COVID-19 Status on Social Media

Published:2024-03-20 Issue: Volume:30 Page:61-67
ISSN:2544-0764
Container-title:Journal of Computer Sciences Institute
language:
Short-container-title:J. Comput. Sci. Inst.

Author:

Budiman Irwan,Faisal Mohammad Reza,Faridhah Astina,Farmadi Andi,Mazdadi Muhammad Itqan,Saragih Triando Hamonangan,Abadi Friska

Abstract

Messages shared on social media platforms like X are automatically categorized into two groups: those who self-report COVID-19 status and those who do not. However, it is essential to note that these messages cannot be a reliable monitoring tool for tracking the spread of the COVID-19 pandemic. The classification of social media messages can be achieved through the application of classification algorithms. Many deep learning-based algorithms, such as Convolutional Neural Networks (CNN) or Long Short-Term Memory (LSTM), have been used for text classification. However, CNN has limitations in understanding global context, while LSTM focuses more on understanding word-by-word sequences. Apart from that, both require a lot of data to learn. Currently, an algorithm is being developed for text classification that can cover the shortcomings of the previous algorithm, namely Bidirectional Encoder Representations from Transformers (BERT). Currently, there are many variants of BERT development. The primary objective of this study was to compare the effectiveness of two classification models, namely BERT and IndoBERT, in identifying self-report messages of COVID-19 status. Both BERT and IndoBERT models were evaluated using raw and preprocessed text data from X. The study's findings revealed that the IndoBERT model exhibited superior performance, achieving an accuracy rate of 94%, whereas the BERT model achieved a performance rate of 82%.

Publisher

Politechnika Lubelska

Reference18 articles.

1. T. Mackey, V. Purushothaman, J. Li, N. Shah, M. Nali, C. Bardier, B. Liang, M. Cai, R. Cuomo, Machine learning to detect self-reporting of symptoms, testing access, and recovery associated with COVID-19 on Twitter: retrospective big data infoveillance study, JMIR public health and surveillance, 6(2) (2020) 1-9, https://doi.org/10.2196/19509

2. A. Z. Klein, A. Magge, K. O’Connor, J. I. Flores Amaro, D. Weissenbacher, and G. Gonzalez Hernandez, Toward using Twitter for tracking COVID-19: a natural language processing pipeline and exploratory data set, Journal of medical Internet research, 23 (1) (2021) 1-6, https://doi.org/10.2196/25314

3. F. E. Ayo, O. Folorunso, F. T. Ibharalu, and I. A. Osinuga, Machine learning techniques for hate speech classification of Twiiter data: State-of-The-Art, future challenges and research directions, Computer Science Review, 38 (2020) 1-34, https://doi.org/10.1016/j.cosrev.2020.100311

4. M. A. Riza, N. Charibaldi, U. Pembangunan, and N. Veteran, Emotion Detection in Twiter Social Media Using Long Short - Term Memory ( LSTM ) and Fast Text, 3 (1) (2021) 15–26, https://doi.org/10.25139/ijair.v3i1.3827

5. A. Chiorrini, C. Diamantini, A. Mircoli, and D. Potena, Emotion and sentiment analysis of posts using BERT, In EDBT/ICDT Workshops, 3 (2021) 1-7