Affiliation:
1. College of Computer Science & Engineering, Northwest Normal University, Lanzhou 730070, China
Abstract
Sentence Boundary Disambiguation (SBD) is crucial for building datasets for tasks such as machine translation, syntactic analysis, and semantic analysis. Currently, most automatic sentence segmentation in Tibetan adopts the methods of rule-based and statistical learning, as well as the combination of the two, which have high requirements on the corpus and the linguistic foundation of the researchers and are more costly to annotate manually. In this study, we explore Tibetan SBD using deep learning technology. Initially, we analyze Tibetan characteristics and various subword techniques, selecting Byte Pair Encoding (BPE) and Sentencepiece (SP) for text segmentation and training the Bidirectional Encoder Representations from Transformers (BERT) pre-trained language model. Secondly, we studied the Tibetan SBD based on different BERT pre-trained language models, which mainly learns the ambiguity of the shad (“།”) in different positions in modern Tibetan texts and determines through the model whether the shad (“།”) in the texts has the function of segmenting sentences. Meanwhile, this study introduces four models, BERT-CNN, BERT-RNN, BERT-RCNN, and BERT-DPCNN, based on the BERT model for performance comparison. Finally, to verify the performance of the pre-trained language models on the SBD task, this study conducts SBD experiments on both the publicly available Tibetan pre-trained language model TiBERT and the multilingual pre-trained language model (Multi-BERT). The experimental results show that the F1 score of the BERT (BPE) model trained in this study reaches 95.32% on 465,669 Tibetan sentences, nearly five percentage points higher than BERT (SP) and Multi-BERT. The SBD method based on pre-trained language models in this study lays the foundation for establishing datasets for the later tasks of Tibetan pre-training, summary extraction, and machine translation.
Funder
National Natural Science Foundation of China
Major science and technology projects of Gansu province
Science and Technology Commissioner Special Project of Gansu province
Gansu Provincial Department of Education: Industry Support Program Project
Northwest Normal University Young Teachers Research Ability Enhancement Program Project
Reference55 articles.
1. Kaur, J., and Singh, J. (2019, January 18–19). Deep Neural Network Based Sentence Boundary Detection and End Marker Suggestion for Social Media Text. Proceedings of the 2019 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS), Greater Noida, India.
2. Enriching speech recognition with automatic detection of sentence boundaries and disfluencies;Liu;IEEE Trans. Audio Speech Lang. Process.,2006
3. Dependency Parsing of Tibetan Compound Sentence;Hua;J. Chin. Inf. Process.,2016
4. Semantic Block Recognition Method for Tibetan Sentences;Rou;J. Chin. Inf. Process.,2019
5. News text classification method and simulation based on the hybrid deep learning model;Sun;Complexity,2021