Pre-trained transformer-based language models for Sundanese-Reference-Cited by-同舟云学术

Pre-trained transformer-based language models for Sundanese

Published:2022-04-13 Issue:1 Volume:9 Page:
ISSN:2196-1115
Container-title:Journal of Big Data
language:en
Short-container-title:J Big Data

Author:

Wongso Wilson^ORCID,Lucky Henry,Suhartono Derwin

Abstract

AbstractThe Sundanese language has over 32 million speakers worldwide, but the language has reaped little to no benefits from the recent advances in natural language understanding. Like other low-resource languages, the only alternative is to fine-tune existing multilingual models. In this paper, we pre-trained three monolingual Transformer-based language models on Sundanese data. When evaluated on a downstream text classification task, we found that most of our monolingual models outperformed larger multilingual models despite the smaller overall pre-training data. In the subsequent analyses, our models benefited strongly from the Sundanese pre-training corpus size and do not exhibit socially biased behavior. We released our models for other researchers and practitioners to use.

Publisher

Springer Science and Business Media LLC

Subject

Information Systems and Management,Computer Networks and Communications,Hardware and Architecture,Information Systems

Link

https://link.springer.com/content/pdf/10.1186/s40537-022-00590-7.pdf

Reference47 articles.

1. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin, I. Attention is all you need. 2017; arXiv preprint arXiv:1706.03762.

2. Rumelhart DE, Hinton GE, Williams RJ. Learning representations by back-propagating errors. nature. 1986;323(6088):533–6.

3. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735–80.

4. Cho K, van Merrienboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y. Learning phrase representations using RNN encoder-decoder for statistical. Mach Transl. 2014;1406:1078.

5. Radford A, Narasimhan K, Salimans T, Sutskever I. Improving language understanding by generative pre-training 2018.

Cited by 6 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Deep context transformer: bridging efficiency and contextual understanding of transformer models;Applied Intelligence;2024-07-06

2. Next-Gen Language Mastery: Exploring Advances in Natural Language Processing Post-transformers;Lecture Notes in Networks and Systems;2024

3. Unveiling Sentiments in Javanese Text: A Study on Sentiment Analysis for the Javanese Language;2023 IEEE 9th Information Technology International Seminar (ITIS);2023-10-18

4. Indonesian-Kailinese Machine Translation;2023 International Conference on Data Science and Its Applications (ICoDSA);2023-08-09

5. Sentiment Analysis on Indonesian-Sundanese Code-Mixed Data;2023 IEEE 8th International Conference for Convergence in Technology (I2CT);2023-04-07