Pre-Training MLM Using Bert for the Albanian Language-Reference-Cited by-同舟云学术

Pre-Training MLM Using Bert for the Albanian Language

Published:2023-06-01 Issue:1 Volume:18 Page:52-62
ISSN:1857-8462
Container-title:SEEU Review
language:en
Short-container-title:

Author:

Kryeziu Labehat¹,Shehu Visar²

Affiliation:

1. 1 Ph.D. Candidate, Faculty of Contemporary Sciences and Technologies , South East European University , North Macedonia

2. 2 Full Professor, Faculty of Contemporary Sciences and Technologies , Southeast European University , North Macedonia

Abstract

Abstract Knowing that language is often used as a classifier of human intelligence and the development of systems that understand human language remains a challenge all the time (Kryeziu & Shehu, 2022). Natural Language Processing is a very active field of study, where transformers have a key role. Transformers function based on neural networks and they are increasingly showing promising results. One of the first major contributions to transfer learning in Natural Language Processing was the use of pre-trained word embeddings in 2010 (Joseph, Lev, & Yoshua, 2010). Pre-trained models like ELMo (Matthew, et al., 2018) and BERT (Delvin, et al., 2019) are trained on large corpora of unlabeled text and as a result learning from text representations has achieved good performance on many of the underlying tasks on datasets from different domains. Pre-training in the language model has proven that there has been an improvement in some aspects of natural language processing, based on the paper (Dai & Le, 2015). In present paper, we will pre-train BERT on the task of Masked Language Modeling (MLM) with the Albanian language dataset (alb_dataset) that we have created for this purpose (Kryeziu et al., 2022). We will compare two approaches: training of BERT using the available OSCAR dataset and using our alb_dataset that we have collected. The paper shows some discrepancies during training, especially while evaluating the performance of the model.

Publisher

Walter de Gruyter GmbH

Link

https://www.sciendo.com/pdf/10.2478/seeur-2023-0035

Reference28 articles.

1. Abdelali, A., Hassan, S., & Mubarak, H. (2021). Pre-Training BERT on Arabic Tweets: Practical Considerations. Qatar Computing Research Institute. Doha 5825, Qatar: arXiv.

2. Alsentzer, E., Murphy, J., Boag, W., Weng, W.-H., Jindi, D., Naumann, T., & McDermott, M. (2019). Publicly Available Clinical BERT Embeddings. (pp. 72-78). Proceedings of the 2nd Clinical Natural Language Processing Workshop.

3. Canete, J., Chaperon, G., & Fuentes, R. (2019). Spanish pre-trained bert model and evaluation data. PML4DC at ICLR.

4. Cui, Y., Che, W., Liu, T., Qin, B., Yang, Z., Wang, S., & Hu, G. (2019). Pre-Training with Whole Word Masking for Chinese BERT.

5. Dai, A., & Le, Q. (2015). Semi-supervised sequence learning”, In Advances in neural information processing systems., (pp. 3079–3087).