Arabic Dialect Identification-Reference-Cited by-同舟云学术

Arabic Dialect Identification

Published:2014-03 Issue:1 Volume:40 Page:171-202
ISSN:0891-2017
Container-title:Computational Linguistics
language:en
Short-container-title:Computational Linguistics

Author:

Zaidan Omar F.¹,Callison-Burch Chris²

Affiliation:

1. Microsoft Research

2. University of Pennsylvania

Abstract

The written form of the Arabic language, Modern Standard Arabic (MSA), differs in a non-trivial manner from the various spoken regional dialects of Arabic—the true “native languages” of Arabic speakers. Those dialects, in turn, differ quite a bit from each other. However, due to MSA's prevalence in written form, almost all Arabic data sets have predominantly MSA content. In this article, we describe the creation of a novel Arabic resource with dialect annotations. We have created a large monolingual data set rich in dialectal Arabic content called the Arabic On-line Commentary Data set (Zaidan and Callison-Burch 2011). We describe our annotation effort to identify the dialect level (and dialect itself) in each of more than 100,000 sentences from the data set by crowdsourcing the annotation task, and delve into interesting annotator behaviors (like over-identification of one's own dialect). Using this new annotated data set, we consider the task of Arabic dialect identification: Given the word sequence forming an Arabic sentence, determine the variety of Arabic in which it is written. We use the data to train and evaluate automatic classifiers for dialect identification, and establish that classifiers using dialectal data significantly and dramatically outperform baselines that use MSA-only data, achieving near-human classification accuracy. Finally, we apply our classifiers to discover dialectical data from a large Web crawl consisting of 3.5 million pages mined from on-line Arabic newspapers.

Publisher

MIT Press - Journals

Subject

Artificial Intelligence,Computer Science Applications,Linguistics and Language,Language and Linguistics

Link

https://www.mitpressjournals.org/doi/pdf/10.1162/COLI_a_00169

Reference46 articles.

1. Arabic Phonology

2. Arabic Sociolinguistics

Cited by 93 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Arabic Automatic Speech Recognition: Challenges and Progress;Speech Communication;2024-09

2. ArabBert-LSTM: improving Arabic sentiment analysis based on transformer model and Long Short-Term Memory;Frontiers in Artificial Intelligence;2024-07-02

3. AfriDial: African Dialect Model based on Deep Learning for Sentiment Analysis;2024 International Wireless Communications and Mobile Computing (IWCMC);2024-05-27

4. Using Transformers to Classify Arabic Dialects on Social Networks;2024 6th International Conference on Pattern Analysis and Intelligent Systems (PAIS);2024-04-24

5. Advancements in Sentiment Analysis for the Algerian Dialect: A Comprehensive Review;2024 6th International Conference on Pattern Analysis and Intelligent Systems (PAIS);2024-04-24