Compression versus traditional machine learning classifiers to detect code-switching in varieties and dialects: Arabic as a case study

Author:

Tarmom Taghreed,Teahan William,Atwell Eric,Alsalka Mohammad Ammar

Abstract

AbstractThe occurrence of code-switching in online communication, when a writer switches among multiple languages, presents a challenge for natural language processing tools, since they are designed for texts written in a single language. To answer the challenge, this paper presents detailed research on ways to detect code-switching in Arabic text automatically. We compare the prediction by partial matching (PPM) compression-based classifier, implemented in Tawa, and a traditional machine learning classifier sequential minimal optimization (SMO), implemented in Waikato Environment for Knowledge Analysis, working specifically on Arabic text taken from Facebook. Three experiments were conducted in order to: (1) detect code-switching among the Egyptian dialect and English; (2) detect code-switching among the Egyptian dialect, the Saudi dialect, and English; and (3) detect code-switching among the Egyptian dialect, the Saudi dialect, Modern Standard Arabic (MSA), and English. Our experiments showed that PPM achieved a higher accuracy rate than SMO with 99.8% versus 97.5% in the first experiment and 97.8% versus 80.7% in the second. In the third experiment, PPM achieved a lower accuracy rate than SMO with 53.2% versus 60.2%. Code-switching between Egyptian Arabic and English text is easiest to detect because Arabic and English are generally written in different character sets. It is more difficult to distinguish between Arabic dialects and MSA as these use the same character set, and most users of Arabic, especially Saudis and Egyptians, frequently mix MSA with their dialects. We also note that the MSA corpus used for training the MSA model may not represent MSA Facebook text well, being built from news websites. This paper also describes in detail the new Arabic corpora created for this research and our experiments.

Publisher

Cambridge University Press (CUP)

Subject

Artificial Intelligence,Linguistics and Language,Language and Linguistics,Software

Reference39 articles.

1. Oco, N. , Wong, J. , Ilao, J. and Roxas, R. (2013). Detecting code-switches using word bigram frequency count. In 9th National Natural Language Processing Research Symposium, Quezon City, Philippines, March, Vol. 7.

2. Arabic Dialect Identification Using iVectors and ASR Transcripts

3. Al-Moghrabi, A.A. (2015). An Examination of Reading Strategies in Arabic (L1) and English (L2) Used by Saudi Female Public High School Adolescents, Doctoral Dissertation, The British University in Dubai (BUiD). Available at https://bspace.buid.ac.ae/handle/1234/776.

Cited by 4 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Evaluation of English Pronunciation Quality Based on Decision Tree Algorithm;2024 International Conference on Integrated Circuits and Communication Systems (ICICACS);2024-02-23

2. Quality Evaluation and Satisfaction Analysis of Online Learning of College Students Based on Artificial Intelligence;Security and Communication Networks;2022-08-08

3. Design of Intelligent Recognition Model for English Translation Based on Deep Machine Learning;Application of Intelligent Systems in Multi-modal Information Analytics;2022

4. Systematic Literature Review of Dialectal Arabic: Identification and Detection;IEEE Access;2021

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3