Affiliation:
1. The Institute of Ethnology and Anthropology, Chinese Academy of Social Sciences, Beijing, P. R. China
Abstract
Tibetan ancient literature is an important literature material for the study of ancient Tibetan culture, history, and the development of Sino Tibetan language family. However, the lack of work on ancient Tibetan word segmentation tools seriously restricts the research of ancient Tibetan literature. In view of this situation, this paper first utilizes ancient Tibetan interlaced contrast tagging data to extract the ancient Tibetan word segmentation dataset. Based on this dataset, we conduct numerous experiments for the task of ancient Tibetan word segmentation. Experimental results show that BiLSTM + CRF word segmentation algorithm can achieve the best performance, and the performance of ancient Tibetan word segmentation can be further improved through model ensemble. And the results show that the unknown words, insufficient training data and word ambiguity restrict the performance of ancient Tibetan word segmentation.
Publisher
World Scientific Pub Co Pte Ltd
Subject
General Earth and Planetary Sciences,General Engineering,General Environmental Science