Applying a Character-Level Model to a Short Arabic Dialect Sentence: A Saudi Dialect as a Case Study

Author:

Alqurashi TahaniORCID

Abstract

Arabic dialect identification (ADI) has recently drawn considerable interest among researchers in language recognition and natural language processing fields. This study investigated the use of a character-level model that is effectively unrestricted in its vocabulary, to identify fine-grained Arabic language dialects in the form of short written text. The Saudi dialects, particularly the four main Saudi dialects across the country, were considered in this study. The proposed ADI approach consists of five main phases, namely dialect data collection, data preprocessing and labelling, character-based feature extraction, deep learning character-based model/classical machine learning character-based models, and model evaluation performance. Several classical machine learning methods, including logistic regression, stochastic gradient descent, variations of the naive Bayes models, and support vector classification, were applied to the dataset. For the deep learning, the character convolutional neural network (CNN) model was adapted with a bidirectional long short-term memory approach. The collected data were tested under various classification tasks, including two-, three- and four-way ADI tasks. The results revealed that classical machine learning algorithms outperformed the CNN approach. Moreover, the use of the term frequency–inverse document frequency, combined with a character n-grams model ranging from unigrams to four-grams achieved the best performance among the tested parameters.

Publisher

MDPI AG

Subject

Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science

Reference35 articles.

1. United Nations Educational, Scientific and Cultural Organization (UNESCO) (2020, June 13). World Arabic Language Day. Available online: https://en.unesco.org/commemorations/worldarabiclanguageday.

2. General Authority for Statistics, Kingdom of Saudi Arabia (2020, June 13). Saudi Census, Available online: https://www.stats.gov.sa/en.

3. Arabic natural language processing: An overview;Guellil;J. King Saud Univ.-Comput. Inf. Sci.,2019

4. Automatic language identification in texts: A survey;Jauhiainen;J. Artif. Intell. Res.,2019

5. Malmasi, S., Refaee, E., and Dras, M. (2015, January 19–21). Arabic dialect identification using a parallel multidialectal corpus. Proceedings of the Conference of the Pacific Association for Computational Linguistics, Bali, Indonesia.

Cited by 4 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Literature Review: NLP Techniques for Arabic Dialect Recognition;2024 International Conference on Circuit, Systems and Communication (ICCSC);2024-06-28

2. Enhancing Arabic Dialect Detection on Social Media: A Hybrid Model with an Attention Mechanism;Information;2024-05-28

3. Features and Methods;Synthesis Lectures on Human Language Technologies;2024

4. Special Issue “Recent Trends in Natural Language Processing and Its Applications”;Applied Sciences;2023-06-19

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3