Diacritic-Based Matching of Arabic Words-Reference-Cited by-同舟云学术

Diacritic-Based Matching of Arabic Words

Published:2019-06-30 Issue:2 Volume:18 Page:1-21
ISSN:2375-4699
Container-title:ACM Transactions on Asian and Low-Resource Language Information Processing
language:en
Short-container-title:ACM Trans. Asian Low-Resour. Lang. Inf. Process.

Author:

Jarrar Mustafa¹^ORCID,Zaraket Fadi²,Asia Rami¹,Amayreh Hamzeh¹

Affiliation:

1. Birzeit University, West Bank, Palestine

2. American University, Beirut, Lebanon

Abstract

Words in Arabic consist of letters and short vowel symbols called diacritics inscribed atop regular letters. Changing diacritics may change the syntax and semantics of a word; turning it into another. This results in difficulties when comparing words based solely on string matching. Typically, Arabic NLP applications resort to morphological analysis to battle ambiguity originating from this and other challenges. In this article, we introduce three alternative algorithms to compare two words with possibly different diacritics. We propose the Subsume knowledge-based algorithm, the Imply rule-based algorithm, and the Alike machine-learning-based algorithm. We evaluated the soundness, completeness, and accuracy of the algorithms against a large dataset of 86,886 word pairs. Our evaluation shows that the accuracy of Subsume (100%), Imply (99.32%), and Alike (99.53%). Although accurate, Subsume was able to judge only 75% of the data. Both Subsume and Imply are sound, while Alike is not. We demonstrate the utility of the algorithms using a real-life use case -- in lemma disambiguation and in linking hundreds of Arabic dictionaries.

Funder

Lebanese National Council for Scientific Research

Birzeit University

VerbMesh project, funded by BZU research committee

Google's Faculty Research Award

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Link

https://dl.acm.org/doi/pdf/10.1145/3242177

Reference29 articles.