Arabic Document Classification: Performance Investigation of Preprocessing and Representation Techniques

Author:

Muaad Abdullah Y.12ORCID,Davanagere Hanumanthappa Jayappa1ORCID,Guru D.S.1,Benifa J.V. Bibal3ORCID,Chola Channabasava3ORCID,AlSalman Hussain4ORCID,Gumaei Abdu H.5ORCID,Al-antari Mugahed A.6ORCID

Affiliation:

1. Department of Studies in Computer Science, University of Mysore, Manasagangothri, Mysore-570006, India

2. Sana’a Community College, Sana’a 5695, Yemen

3. Department of Computer Science and Engineering, Indian Institute of Information Technology, Kottayam, India

4. Department of Computer Science, College of Computer and Information Sciences, King Saud University, Riyadh, 11543, Saudi Arabia

5. Computer Science Department, Faculty of Applied Sciences, Taiz University, Taiz 6803, Yemen

6. Department of Artificial Intelligence, Daeyang AI Center, Sejong University, Seoul 05006, Republic of Korea

Abstract

With the increasing number of online social posts, review comments, and digital documentations, the Arabic text classification (ATC) task has been hugely required for many spontaneous natural language processing (NLP) applications, especially within the coronavirus pandemics. The variations in the meaning of the same Arabic words could directly affect the performance of any AI-based framework. This work aims to identify the effectiveness of machine learning (ML) algorithms through preprocessing and representation techniques. This effectiveness is measured via different AI-based classification techniques. Basically, the ATC process is influenced by several factors such as stemming in preprocessing, method of feature extraction and selection, nature of datasets, and classification algorithm. To improve the overall classification performance, preprocessing techniques are mainly used to convert each Arabic word into its root and decrease the representation dimension among the datasets. Feature extraction and selection always play crucial roles to represent the Arabic text in a meaningful way and improve the classification accuracy rate. The selected classifiers in this study are performed based on various feature selection algorithms. The overall classification evaluation results are compared using different classifiers such as multinomial Naive Bayes (MNB), Bernoulli Naive Bayes (BNB), Stochastic Gradient Descent (SGD), Support Vector Classifier (SVC), Logistic Regression (LR), and Linear SVC. All of these AI classifiers are evaluated using five balanced and unbalanced benchmark datasets: BBC Arabic corpus, CNN Arabic corpus, Open-Source Arabic corpus (OSAc), ArCovidVac, and AlKhaleej. The evaluation results show that the classification performance strongly depends on the preprocessing technique, representation methods and classification technique, and the nature of datasets used. For the considered benchmark datasets, the linear SVC has outperformed other classifiers overall when prominent features are selected.

Funder

King Saud University

Publisher

Hindawi Limited

Subject

General Engineering,General Mathematics

Reference79 articles.

1. Mining Text Data

2. A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques;M. Allahyari,2017

3. Mohammed Suhail Representation and Classification of Text Data, Ph.D. Thesis, University of Mysore, Des – 2019;M. suhail,2019

4. Arabic text stemming: Comparative analysis

5. Introduction to Arabic Natural Language Processing

Cited by 11 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3