AraCust: a Saudi Telecom Tweets corpus for sentiment analysis-Reference-Cited by-同舟云学术

AraCust: a Saudi Telecom Tweets corpus for sentiment analysis

Published:2021-05-20 Issue: Volume:7 Page:e510
ISSN:2376-5992
Container-title:PeerJ Computer Science
language:en
Short-container-title:

Author:

Almuqren Latifah¹²,Cristea Alexandra¹

Affiliation:

1. Department of Computer Science, Durham University, Durham, United Kingdom

2. Information Science Department, Computer and Information Sciences College, Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia

Abstract

Comparing Arabic to other languages, Arabic lacks large corpora for Natural Language Processing (Assiri, Emam & Al-Dossari, 2018; Gamal et al., 2019). A number of scholars depended on translation from one language to another to construct their corpus (Rushdi-Saleh et al., 2011). This paper presents how we have constructed, cleaned, pre-processed, and annotated our 20,0000 Gold Standard Corpus (GSC) AraCust, the first Telecom GSC for Arabic Sentiment Analysis (ASA) for Dialectal Arabic (DA). AraCust contains Saudi dialect tweets, processed from a self-collected Arabic tweets dataset and has been annotated for sentiment analysis, i.e.,manually labelled (k=0.60). In addition, we have illustrated AraCust’s power, by performing an exploratory data analysis, to analyse the features that were sourced from the nature of our corpus, to assist with choosing the right ASA methods for it. To evaluate our Golden Standard corpus AraCust, we have first applied a simple experiment, using a supervised classifier, to offer benchmark outcomes for forthcoming works. In addition, we have applied the same supervised classifier on a publicly available Arabic dataset created from Twitter, ASTD (Nabil, Aly & Atiya, 2015). The result shows that our dataset AraCust outperforms the ASTD result with 91% accuracy and 89% F1avg score. The AraCust corpus will be released, together with code useful for its exploration, via GitHub as a part of this submission.

Funder

Deanship of Scientific Research at Princess Nourah bint Abdulrahman University, Saudi Arabia

Publisher

PeerJ

Subject

General Computer Science

Link

https://peerj.com/articles/cs-510.pdf

Reference82 articles.

1. Sentiment analysis in multiple languages: feature selection for opinion classification in web forums;Abbasi;ACM Transactions on Information Systems,2008

2. Toward building a large-scale Arabic sentiment lexicon;Abdul-Mageed,2012

3. SAMAR: subjectivity and sentiment analysis for Arabic social media;Abdul-Mageed;Computer Speech Language,2014

4. Towards improving the lexicon-based approach for arabic sentiment analysis;Abdulla;International Journal of Information Technology Web Engineering,2014

5. Effect of Saudi dialect preprocessing on Arabic sentiment analysis;Al-Harbi;International Journal of Advanced Computer Technology,2015

Cited by 17 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. SOD: A Corpus for Saudi Offensive Language Detection Classification;Computers;2024-08-20

2. A Deep Learning-based Classification Model for Arabic News Tweets Using Bidirectional Long Short-Term Memory Networks;Pertanika Journal of Science and Technology;2024-07-16

3. A hybrid neural network model based on transfer learning for Arabic sentiment analysis of customer satisfaction;Engineering Reports;2024-03-03

4. DEPREM ZAMANINDAKİ GSM OPERATÖRLERİNE İLİŞKİN TÜKETİCİ ALGILARININ SOSYAL MEDYA PAYLAŞIMLARINDA ARAŞTIRILMASI;Akademik Yaklaşımlar Dergisi;2024-02-22

5. Toward Early Detection of Depression: Detecting Depression Symptoms in Arabic Tweets Using Pretrained Transformers;IEEE Access;2024