Affiliation:
1. Department of Computer Science, Faculty of Sciences, University of Ferhat Abbas Setif 1, Setif, Algeria
Abstract
Background:
With the explosion of communication technologies and the accompanying
pervasive use of social media, we notice an outstanding proliferation of posts, reviews, comments,
and other forms of expressions in different languages. This content attracted researchers from different
fields; economics, political sciences, social sciences, psychology and particularly language
processing. One of the prominent subjects is the discrimination between similar languages and dialects
using natural language processing and machine learning techniques. The problem is usually
addressed by formulating the identification as a classification task.
Methods:
The approach is based on machine learning classification methods to discriminate between
Modern Standard Arabic (MSA) and four regional Arabic dialects: Egyptian, Levantine, Gulf
and North-African. Several models were trained to discriminate between the studied dialects in
large corpora mined from online Arabic newspapers and manually annotated.
Results:
Experimental results showed that n-gram features could substantially improve performance.
Logistic regression based on character and word n-gram model using Count Vectors identified
the handled dialects with an overall accuracy of 95%. Best results were achieved with Linear
Support vector classifier using TF-IDF Vectors trained by character-based uni-gram, bi-gram, trigram,
and word-based uni-gram, bi-gram with an overall accuracy of 95.1%.
Conclusion:
The results showed that n-gram features could substantially improve performance. Additionally,
we noticed that the kind of data representation could provide a significant performance
boost compared to simple representation.
Publisher
Bentham Science Publishers Ltd.
Reference22 articles.
1. Zaidan O.F.; Callison-Burch C.; "Arabic dialect identification." Comput Linguist 2014,40(1),171-202
2. Biadsy F.; Hirschberg J.; Habash N.; "Spoken Arabic dialect identification using phonotactic modeling" In Proceedings of the EACL 2009 workshop on computational approaches to semitic languages 2009,53-61
3. Shon S.; Ali A.; Glass J.; Convolutional neural networks and language embeddings for end-to-end dialect recognition , arXiv
preprint arXiv:180304567, 2018
4. Zaidan O.F.; Callison-Burch C.; "The Arabic online commentary dataset: an annotated dataset of informal Arabic with high dialectal content" In Proceedings of the 49 Annual Meeting of the Association for Computational Linguistics: Human Language Technologies 2011,37-41
5. Elfardy H.; Diab M.; "Sentence level dialect identification in Arabic" In Proceedings of the 51 Annual Meeting of the Association for Computational Linguistics 2013,2,456-461