A Statistical Language Model for Pre-Trained Sequence Labeling: A Case Study on Vietnamese-Reference-Cited by-同舟云学术

A Statistical Language Model for Pre-Trained Sequence Labeling: A Case Study on Vietnamese

Published:2022-05-31 Issue:3 Volume:21 Page:1-21
ISSN:2375-4699
Container-title:ACM Transactions on Asian and Low-Resource Language Information Processing
language:en
Short-container-title:ACM Trans. Asian Low-Resour. Lang. Inf. Process.

Author:

Liao Xianwen¹,Huang Yongzhong¹,Yang Peng¹,Chen Lei¹

Affiliation:

1. Guilin University of Electronic Technology, Guilin, Guangxi, China

Abstract

By defining the computable word segmentation unit and studying its probability characteristics, we establish an unsupervised statistical language model (SLM) for a new pre-trained sequence labeling framework in this article. The proposed SLM is an optimization model, and its objective is to maximize the total binding force of all candidate word segmentation units in sentences under the condition of no annotated datasets and vocabularies. To solve SLM, we design a recursive divide-and-conquer dynamic programming algorithm. By integrating SLM with the popular sequence labeling models, Vietnamese word segmentation, part-of-speech tagging and named entity recognition experiments are performed. The experimental results show that our SLM can effectively promote the performance of sequence labeling tasks. Just using less than 10% of training data and without using a dictionary, the performance of our sequence labeling framework is better than the state-of-the-art Vietnamese word segmentation toolkit VnCoreNLP on the cross-dataset test. SLM has no hyper-parameter to be tuned, and it is completely unsupervised and applicable to any other analytic language. Thus, it has good domain adaptability.

Funder

National Natural Science Foundation of China

Basic and Applied Basic Research Fund of Guangdong Province, China

Guangxi Science and Technology Plan Projecet

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Link

https://dl.acm.org/doi/pdf/10.1145/3483524

Reference40 articles.

1. Fine-Grained Named Entity Typing over Distantly Supervised Data Based on Refined Representations

2. Zero-Resource Cross-Lingual Named Entity Recognition

3. Neural Word Segmentation Learning for Chinese