Affiliation:
1. Department of Computer Science, College of Computer Science and Engineering, Taibah University, Madinah 42353, Saudi Arabia
Abstract
Pretrained language models have achieved great success in various natural language understanding (NLU) tasks due to their capacity to capture deep contextualized information in text using pretraining on large-scale corpora. Tokenization plays a significant role in the process of lexical analysis. Tokens become the input for other natural language processing (NLP) tasks, like semantic parsing and language modeling. However, there is a lack of research on the evaluation of the impact of tokenization on the Arabic language model. Therefore, this study aims to address this gap in the literature by evaluating the performance of various tokenizers on Arabic large language models (LLMs). In this paper, we analyze the differences between WordPiece, SentencePiece, and BBPE tokenizers by pretraining three BERT models using each tokenizer while measuring the performance of each model on seven different NLP tasks using 29 different datasets. Overall, the model pretrained with text tokenized using the SentencePiece tokenizer significantly outperforms the other two models that utilize WordPiece and BBPE tokenizers. The results of this paper will assist researchers in developing better models, making better decisions in selecting the best tokenizers, improving feature engineering, and making models more efficient, thus ultimately leading to advancements in various NLP applications.
Reference80 articles.
1. Evaluating various tokenizers for Arabic text classification;Alyafeai;Neural Process. Lett.,2023
2. Shapiro, P., and Duh, K. (2018, January 5–7). Morphological word embeddings for Arabic neural machine translation in low-resource settings. Proceedings of the Second Workshop on Subword/Character LEvel Models, New Orleans, LA, USA.
3. Tokenization of Tunisian Arabic: A comparison between three Machine Learning models;Mekki;Acm Trans. Asian -Low-Resour. Lang. Inf. Process.,2023
4. Kamali, D., Janfada, B., Shenasa, M.E., and Minaei-Bidgoli, B. (2022). Evaluating Persian Tokenizers. arXiv.
5. Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv.