Tibetan Sentence Boundaries Automatic Disambiguation Based on Bidirectional Encoder Representations from Transformers on Byte Pair Encoding Word Cutting Method-Reference-Cited by-同舟云学术

Tibetan Sentence Boundaries Automatic Disambiguation Based on Bidirectional Encoder Representations from Transformers on Byte Pair Encoding Word Cutting Method

Published:2024-04-02 Issue:7 Volume:14 Page:2989
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Li Fenfang¹^ORCID,Zhao Zhengzhang¹,Wang Li¹,Deng Han¹

Affiliation:

1. College of Computer Science & Engineering, Northwest Normal University, Lanzhou 730070, China

Abstract

Sentence Boundary Disambiguation (SBD) is crucial for building datasets for tasks such as machine translation, syntactic analysis, and semantic analysis. Currently, most automatic sentence segmentation in Tibetan adopts the methods of rule-based and statistical learning, as well as the combination of the two, which have high requirements on the corpus and the linguistic foundation of the researchers and are more costly to annotate manually. In this study, we explore Tibetan SBD using deep learning technology. Initially, we analyze Tibetan characteristics and various subword techniques, selecting Byte Pair Encoding (BPE) and Sentencepiece (SP) for text segmentation and training the Bidirectional Encoder Representations from Transformers (BERT) pre-trained language model. Secondly, we studied the Tibetan SBD based on different BERT pre-trained language models, which mainly learns the ambiguity of the shad (“།”) in different positions in modern Tibetan texts and determines through the model whether the shad (“།”) in the texts has the function of segmenting sentences. Meanwhile, this study introduces four models, BERT-CNN, BERT-RNN, BERT-RCNN, and BERT-DPCNN, based on the BERT model for performance comparison. Finally, to verify the performance of the pre-trained language models on the SBD task, this study conducts SBD experiments on both the publicly available Tibetan pre-trained language model TiBERT and the multilingual pre-trained language model (Multi-BERT). The experimental results show that the F1 score of the BERT (BPE) model trained in this study reaches 95.32% on 465,669 Tibetan sentences, nearly five percentage points higher than BERT (SP) and Multi-BERT. The SBD method based on pre-trained language models in this study lays the foundation for establishing datasets for the later tasks of Tibetan pre-training, summary extraction, and machine translation.

Funder

National Natural Science Foundation of China

Major science and technology projects of Gansu province

Science and Technology Commissioner Special Project of Gansu province

Gansu Provincial Department of Education: Industry Support Program Project

Northwest Normal University Young Teachers Research Ability Enhancement Program Project

Publisher

MDPI AG

Link

https://www.mdpi.com/2076-3417/14/7/2989/pdf

Reference55 articles.

1. Kaur, J., and Singh, J. (2019, January 18–19). Deep Neural Network Based Sentence Boundary Detection and End Marker Suggestion for Social Media Text. Proceedings of the 2019 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS), Greater Noida, India.

2. Enriching speech recognition with automatic detection of sentence boundaries and disfluencies;Liu;IEEE Trans. Audio Speech Lang. Process.,2006

3. Dependency Parsing of Tibetan Compound Sentence;Hua;J. Chin. Inf. Process.,2016

4. Semantic Block Recognition Method for Tibetan Sentences;Rou;J. Chin. Inf. Process.,2019

5. News text classification method and simulation based on the hybrid deep learning model;Sun;Complexity,2021