Investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting Arabic health information on social media-Reference-Cited by-同舟云学术

Investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting Arabic health information on social media

Published:2021-07-02 Issue:1 Volume:8 Page:
ISSN:2196-1115
Container-title:Journal of Big Data
language:en
Short-container-title:J Big Data

Author:

Albalawi Yahya^ORCID,Buckley Jim,Nikolov Nikola S.

Abstract

AbstractThis paper presents a comprehensive evaluation of data pre-processing and word embedding techniques in the context of Arabic document classification in the domain of health-related communication on social media. We evaluate 26 text pre-processings applied to Arabic tweets within the process of training a classifier to identify health-related tweets. For this task we use the (traditional) machine learning classifiers KNN, SVM, Multinomial NB and Logistic Regression. Furthermore, we report experimental results with the deep learning architectures BLSTM and CNN for the same text classification problem. Since word embeddings are more typically used as the input layer in deep networks, in the deep learning experiments we evaluate several state-of-the-art pre-trained word embeddings with the same text pre-processing applied. To achieve these goals, we use two data sets: one for both training and testing, and another for testing the generality of our models only. Our results point to the conclusion that only four out of the 26 pre-processings improve the classification accuracy significantly. For the first data set of Arabic tweets, we found that Mazajak CBOW pre-trained word embeddings as the input to a BLSTM deep network led to the most accurate classifier with F1 score of 89.7%. For the second data set, Mazajak Skip-Gram pre-trained word embeddings as the input to BLSTM led to the most accurate model with F1 score of 75.2% and accuracy of 90.7% compared to F1 score of 90.8% achieved by Mazajak CBOW for the same architecture but with lower accuracy of 70.89%. Our results also show that the performance of the best of the traditional classifier we trained is comparable to the deep learning methods on the first dataset, but significantly worse on the second dataset.

Publisher

Springer Science and Business Media LLC

Subject

Information Systems and Management,Computer Networks and Communications,Hardware and Architecture,Information Systems

Link

https://link.springer.com/content/pdf/10.1186/s40537-021-00488-w.pdf

Reference127 articles.

1. Kanan T, Sadaqa O, Aldajeh A, Alshwabka H, Dolime WA, AlZu’bi S et al., editors. A review of natural language processing and machine learning tools used to analyze arabic social media. In: 2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT); 2019 9–11 April 2019.

2. Al-Ayyoub M, Nuseir A, Alsmearat K, Jararweh Y, Gupta B. Deep learning for arabic nlp: a survey. J Comput Sci. 2018;26:522–31. https://doi.org/10.1016/j.jocs.2017.11.011.

3. Abo MEM, Raj RG, Qazi A. A review on arabic sentiment analysis: state-of-the-art, taxonomy and open research challenges. IEEE Access. 2019;7:162008–24. https://doi.org/10.1109/ACCESS.2019.2951530.

4. Alrifai K, Rebdawi G, Ghneim N. Arabic tweeps gender and dialect prediction: notebook for pan at clef 2017. CEUR Workshop Proceedings2017. p. 1–9.

5. HaCohen-Kerner Y, Yigal Y, Shayovitz E, Miller D, Breckon T, editors. Author profiling: Gender prediction from tweets and images: notebook for pan at clef 2018. CEUR Workshop Proceedings; 2018.

Cited by 27 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. How the crisis of trust in experts occurs on social media in China? Multiple-case analysis based on data mining;Humanities and Social Sciences Communications;2024-08-27

2. TrG2P: A transfer-learning-based tool integrating multi-trait data for accurate prediction of crop yield;Plant Communications;2024-07

3. Analyzing recent trends in deep-learning approaches: a review on urban environmental hazards and disaster studies for monitoring, management, and mitigation toward sustainability;International Journal on Smart Sensing and Intelligent Systems;2024-04-01

4. Is text preprocessing still worth the time? A comparative survey on the influence of popular preprocessing methods on Transformers and traditional classifiers;Information Systems;2024-03

5. Sentiment Analysis of Turkish Drug Reviews with Bidirectional Encoder Representations from Transformers;ACM Transactions on Asian and Low-Resource Language Information Processing;2024-01-15