The Impact of Arabic Diacritization on Word Embeddings

Author:

Abbache Mohamed1ORCID,Abbache Ahmed2ORCID,Xu Jingwen3ORCID,Meziane Farid4ORCID,Wen Xianbin1ORCID

Affiliation:

1. School of Computer Science and Technology, Tianjin University of Technology, Tianjin, China

2. Mathematics and its Applications Laboratory, Faculty of Exact Sciences and Computing, Hassiba Ben Bouali University of Chlef, Ouled Fares, Chlef Province, Algeria

3. Computer Science, Faculty of Information Engineering, Computer Science and Statistics, Sapienza University of Rome, Rome, Italy

4. Data Science Research Centre, University of Derby, The United Kingdom

Abstract

Word embedding is used to represent words for text analysis. It plays an essential role in many Natural Language Processing (NLP) studies and has hugely contributed to the extraordinary developments in the field in the last few years. In Arabic, diacritic marks are a vital feature for the readability and understandability of the language. Current Arabic word embeddings are non-diacritized. In this article, we aim to develop and compare word embedding models based on diacritized and non-diacritized corpora to study the impact of Arabic diacritization on word embeddings. We propose evaluating the models in four different ways: clustering of the nearest words; morphological semantic analysis; part-of-speech tagging; and semantic analysis. For a better evaluation, we took the challenge to create three new datasets from scratch for the three downstream tasks. We conducted the downstream tasks with eight machine learning algorithms and two deep learning algorithms. Experimental results show that the diacritized model exhibits a better ability to capture syntactic and semantic relations and in clustering words of similar categories. Overall, the diacritized model outperforms the non-diacritized model. We obtained some more interesting findings. For example, from the morphological semantics analysis, we found that with the increase in the number of target words, the advantages of the diacritized model are also more obvious, and the diacritic marks have more significance in POS tagging than in other tasks.

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Reference23 articles.

1. Wael Abid and Younes Bensouda Mourri. 2018. Improving English to Arabic machine translation. In Proceedings of the 32nd Conference on Neural Information Processing Systems (NIPS’18), Montréal, Canada.

2. Tosin P. Adewumi, Foteini Liwicki, and Marcus Liwicki. 2020. The challenge of diacritics in Yorùbá embeddings. In Proceedings of the ML4D Workshop at 34th Conference on Neural Information Processing Systems (NeurIPS) 2020 Workshop on Machine Learning for the Developing World. Vancouver, Canada. arXiv preprint arXiv:2011.07605.

3. Massive vs. curated embeddings for low-resourced languages: the case of Yorùbá and Twi;Alabi Jesujoba;Proceedings of the 12th Language Resources and Evaluation Conference (LREC’20). Computation and Language,2020

4. Diacritization, automatic segmentation and labeling for Levantine Arabic speech

5. Muhammad Altabba, Ammar Al-Zaraee, and Mohammad Arif Shukairy. 2010. An Arabic Morphological Analyzer and Part-of-speech tagger. A Thesis Presented to the Faculty of Informatics Engineering, Arab International University, Damascus, Syria.

Cited by 2 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Toward Robust Arabic AI-Generated Text Detection: Tackling Diacritics Challenges;Information;2024-07-19

2. AIRABIC: Arabic Dataset for Performance Evaluation of AI Detectors;2023 International Conference on Machine Learning and Applications (ICMLA);2023-12-15

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3