Affiliation:
1. Ibn Zohr University
2. Ibn ZohrUniversity Ait Melloul
Abstract
Abstract
In the evolving context of automated language processing, unraveling the complex fabric of Moroccan Arabic (Darija) in a multilingual environment is a major challenge. This study embarks on the arduous task of detecting the nuances of Moroccan Arabic within a binary dataset that consists of both Standard Arabic and Darija expressions. Using a comprehensive methodology, we interweave a sophisticated set of feature selection techniques, including well-known extraction techniques in natural language processing (NLP) such as TF-idf, CBOW, and Word2Vec. By leveraging the capabilities of machine learning techniques via LASSO decision and regression trees, we navigate the labyrinth of linguistic diversity, relying on semantic methods that consist of using mostly advanced encoders to deepen our understanding of the distinctive linguistic fabric. We also look at static methods of feature selection, such as ANOVA, Pearson correlation coefficient, and mutual information, in order to add strata of analysis. Finally, by emphasizing the paramount importance of dimensionality reduction through principal component analysis (PCA) and singular value decomposition (SVD), our methodology not only preserves the essential structures in the high-dimensional linguistic space, but also strives to contribute significantly to the accurate detection of Moroccan Arabic dialects. The stability of our results is achieved with the XGBOOST algorithm using classical extraction methods and SVD, with a reasonable execution time. This research aspires not only to unveil the specific subtleties of Arabic dialects, but also to open up new horizons in the field of natural language processing within diverse and multilingual societies.
Publisher
Research Square Platform LLC
Reference31 articles.
1. Naaima Boudad, Rdouan Faizi, Rachid Oulad Haj Thami, Raddouane Chiheb ,”Sentiment analysis in Arabic : A review of the literature “,ENSIAS, Mohammed V University, Rabat, Morocco the literature. Ain Shams Eng. J. (2017, in press). https://doi.org/10.1016/j.asej.2017.04.007
2. "Feature Selection Techniques for Improved Multilingual Text Classification;Smith J;Journal of Natural Language Processing,2018
3. Garcia, A., et al. (2019). "Cross-Language Word Embeddings for Improved Semantic Similarity." Proceedings of the International Conference on Natural Language Processing, 76–88.
4. "Enhancing Sentiment Analysis Through Recursive Feature Elimination;Chen X;Journal of Computational Linguistics,2020
5. "Dimensionality Reduction in NLP: A Comparative Analysis.";Rodriguez M;ACM Transactions on Natural Language Processing,2017