Abstract
Thai sentence segmentation has been on the topic of interest among Thai NLP communities. However, not much literature has explored the use of transformer-based large language models to tackle the issue. We conduct three experiments on the LST20 corpus, including (1) fine-tuning WangchanBERTa, a large language model pre-trained on Thai, across different classification tasks, (2) joint learning for clause and sentence segmentation, and (3) cross-lingual transfer using the multilingual model XLM-RoBERTa. Our findings show that WangchanBERTa outperforms other models in Thai sentence segmentation, and fine-tuning it with token and contextual information further improves its performance. However, cross-lingual transfer from English and Chinese to Thai is not effective for this task.
Publisher
Office of Academic Resources, Chulalongkorn University
Cited by
4 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Scaling Dual Stage Image-Text Retrieval with Multimodal Large Language Models;2024 International Joint Conference on Neural Networks (IJCNN);2024-06-30
2. EMS: Efficient and Effective Massively Multilingual Sentence Embedding Learning;IEEE/ACM Transactions on Audio, Speech, and Language Processing;2024
3. ASAGeR: Automated Short Answer Grading Regressor via Sentence Simplification;2023 IEEE International Conference on Knowledge Graph (ICKG);2023-12-01
4. Uncertainty Estimation for Complex Text Detection in Spanish;2023 IEEE 5th International Conference on BioInspired Processing (BIP);2023-11-28