Arabic Diacritic Recovery Using a Feature-rich biLSTM Model-Reference-Cited by-同舟云学术

Arabic Diacritic Recovery Using a Feature-rich biLSTM Model

Published:2021-04-08 Issue:2 Volume:20 Page:1-18
ISSN:2375-4699
Container-title:ACM Transactions on Asian and Low-Resource Language Information Processing
language:en
Short-container-title:ACM Trans. Asian Low-Resour. Lang. Inf. Process.

Author:

Darwish Kareem¹,Abdelali Ahmed¹^ORCID,Mubarak Hamdy¹,Eldesouki Mohamed¹

Affiliation:

1. Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar

Abstract

Diacritics (short vowels) are typically omitted when writing Arabic text, and readers have to reintroduce them to correctly pronounce words. There are two types of Arabic diacritics: The first are core-word diacritics (CW), which specify the lexical selection, and the second are case endings (CE), which typically appear at the end of word stems and generally specify their syntactic roles. Recovering CEs is relatively harder than recovering core-word diacritics due to inter-word dependencies, which are often distant. In this article, we use feature-rich recurrent neural network model that use a variety of linguistic and surface-level features to recover both core word diacritics and case endings. Our model surpasses all previous state-of-the-art systems with a CW error rate (CWER) of 2.9% and a CE error rate (CEER) of 3.7% for Modern Standard Arabic (MSA) and CWER of 2.2% and CEER of 2.5% for Classical Arabic (CA). When combining diacritized word cores with case endings, the resultant word error rates are 6.0% and 4.3% for MSA and CA, respectively. This highlights the effectiveness of feature engineering for such deep neural models.

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Link

https://dl.acm.org/doi/pdf/10.1145/3434235

Reference46 articles.

1. Martín Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng Chen Craig Citro Greg S. Corrado Andy Davis Jeffrey Dean Matthieu Devin Sanjay Ghemawat Ian Goodfellow Andrew Harp Geoffrey Irving Michael Isard Yangqing Jia Rafal Jozefowicz Lukasz Kaiser Manjunath Kudlur Josh Levenberg Dandelion Mané Rajat Monga Sherry Moore Derek Murray Chris Olah Mike Schuster Jonathon Shlens Benoit Steiner Ilya Sutskever Kunal Talwar Paul Tucker Vincent Vanhoucke Vijay Vasudevan Fernanda Viégas Oriol Vinyals Pete Warden Martin Wattenberg Martin Wicke Yuan Yu and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Retrieved from https://www.tensorflow.org/. Martín Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng Chen Craig Citro Greg S. Corrado Andy Davis Jeffrey Dean Matthieu Devin Sanjay Ghemawat Ian Goodfellow Andrew Harp Geoffrey Irving Michael Isard Yangqing Jia Rafal Jozefowicz Lukasz Kaiser Manjunath Kudlur Josh Levenberg Dandelion Mané Rajat Monga Sherry Moore Derek Murray Chris Olah Mike Schuster Jonathon Shlens Benoit Steiner Ilya Sutskever Kunal Talwar Paul Tucker Vincent Vanhoucke Vijay Vasudevan Fernanda Viégas Oriol Vinyals Pete Warden Martin Wattenberg Martin Wicke Yuan Yu and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Retrieved from https://www.tensorflow.org/.

2. Automatic diacritization of Arabic text using recurrent neural networks;Abandah Gheith A.;Int. J. Doc. Anal. Recogn.,2015

Cited by 7 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. BERT-Based Arabic Diacritization: A state-of-the-art approach for improving text accuracy and pronunciation;Expert Systems with Applications;2024-08

2. Unlocking the Power of Transfer Learning with Ad-Dabit-Al-Lughawi: A Token Classification Approach for Enhanced Arabic Text Diacritization;2024

3. Arabic Diacritics Restoration Using Maximum Entropy Language Models;IEEE Signal Processing Letters;2023

4. Arabic Syntactic Diacritics Restoration Using BERT Models;Computational Intelligence and Neuroscience;2022-10-30

5. Simple Extensible Deep Learning Model for Automatic Arabic Diacritization;ACM Transactions on Asian and Low-Resource Language Information Processing;2022-03-31