A Mixed Malay–English Language COVID-19 Twitter Dataset: A Sentiment Analysis-Reference-Cited by-同舟云学术

A Mixed Malay–English Language COVID-19 Twitter Dataset: A Sentiment Analysis

Published:2023-03-27 Issue:2 Volume:7 Page:61
ISSN:2504-2289
Container-title:Big Data and Cognitive Computing
language:en
Short-container-title:BDCC

Author:

Kong Jeffery T. H.¹^ORCID,Juwono Filbert H.²,Ngu Ik Ying³,Nugraha I. Gde Dharma⁴^ORCID,Maraden Yan⁴,Wong W. K.²

Affiliation:

1. Department of Electrical and Computer Engineering, Curtin University Malaysia, Miri 98009, Malaysia

2. Computer Science Program, University of Southampton Malaysia, Iskandar Puteri 79100, Malaysia

3. Department of Media and Communication, Curtin University Malaysia, Miri 98009, Malaysia

4. Department of Electrical Engineering, Universitas Indonesia, Depok 16424, Indonesia

Abstract

Social media has evolved into a platform for the dissemination of information, including fake news. There is a lot of false information about the current situation of the Coronavirus Disease 2019 (COVID-19) pandemic, such as false information regarding vaccination. In this paper, we focus on sentiment analysis for Malaysian COVID-19-related news on social media such as Twitter. Tweets in Malaysia are often a combination of Malay, English, and Chinese with plenty of short forms, symbols, emojis, and emoticons within the maximum length of a tweet. The contributions of this paper are twofold. Firstly, we built a multilingual COVID-19 Twitter dataset, comprising tweets written from 1 September 2021 to 12 December 2021. In particular, we collected 108,246 tweets, with over 67% in Malay language, 27% in English, 2% in Chinese, and 4% in other languages. We then manually annotated and assigned the sentiment of 11,568 tweets into three-class sentiments (positive, negative, and neutral) to develop a Malay-language sentiment analysis tool. For this purpose, we applied a data compression method using Byte-Pair Encoding (BPE) on the texts and used two deep learning approaches, i.e., the Multilingual Bidirectional Encoder Representation for Transformer (M-BERT) and convolutional neural network (CNN). BPE tokenization is used to encode rare and unknown words into smaller meaningful subwords. With the CNN, we converted the labeled tweets into image files. Our experiments explored different BPE vocabulary sizes with our BPE-Text-to-Image-CNN and BPE-M-BERT models. The results show that the optimal vocabulary size for BPE is 12,000; any values beyond that would not contribute much to the F1-score. Overall, our results show that BPE-M-BERT slightly outperforms the CNN model, thereby showing that the pre-trained M-BERT network has the advantage for our multilingual dataset.

Funder

Fundamental Research Grant Scheme

Universitas Indonesia’s International Indexed Publication (PUTI) Q1

Publisher

MDPI AG

Subject

Artificial Intelligence,Computer Science Applications,Information Systems,Management Information Systems

Link

https://www.mdpi.com/2504-2289/7/2/61/pdf

Reference46 articles.

1. Usage of social media during the pandemic: Seeking support and awareness about COVID-19 through social media platforms;Saud;J. Public Aff.,2020

2. Feeling positive about reopening? New normal scenarios from COVID-19 US reopen sentiment analytics;Samuel;IEEE Access,2020

3. Critical impact of social networks infodemic on defeating coronavirus COVID-19 pandemic: Twitter-based study and research directions;Mourad;IEEE Trans. Netw. Serv. Manag.,2020

4. Balancing between holistic and cumulative sentiment classification;Agathangelou;Online Soc. Netw. Media,2022

5. Hasan, A., Moin, S., Karim, A., and Shamshirband, S. (2018). Machine learning-based sentiment analysis for twitter accounts. Math. Comput. Appl., 23.

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Sentiment Analysis in Low-Resource Settings: A Comprehensive Review of Approaches, Languages, and Data Sources;IEEE Access;2024

2. A Review of Sentimental Analysis for Vaccine Dataset Using BI-LSTM Method;2023 IEEE International Conference on ICT in Business Industry & Government (ICTBIG);2023-12-08