Visual and Phonological Feature Enhanced Siamese BERT for Chinese Spelling Error Correction-Reference-Cited by-同舟云学术

Visual and Phonological Feature Enhanced Siamese BERT for Chinese Spelling Error Correction

Published:2022-04-30 Issue:9 Volume:12 Page:4578
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Liu Yujia,Guo Hongliang,Wang Shuai^ORCID,Wang Tiejun

Abstract

Chinese Spelling Check (CSC) aims to detect and correct spelling errors in Chinese. Most CSC models rely on human-defined confusion sets to narrow the search space, failing to resolve errors outside the confusion set. However, most spelling errors in current benchmark datasets are character pairs in similar pronunciations. Errors in similar shapes and errors which are visually and phonologically irrelevant are not considered. Furthermore, widely-used automatically generated training data in CSC tasks leads to label leakage and unfair comparison between different methods. In this work, we propose a feature (visual and phonological) enhanced siamese BERT to (1) correct spelling errors without using confusion sets; (2) integrate phonological and visual features for CSC by a glyph graph; (3) improve performance for unseen spelling errors. To evaluate CSC methods fairly and comprehensively, we build a large-scale CSC dataset in which the number of samples in different error types is the same. The experimental results show that the proposed approach achieves better performance compared with previous state-of-the-art methods on three benchmark datasets and the new error-type balanced dataset.

Publisher

MDPI AG

Subject

Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science

Link

https://www.mdpi.com/2076-3417/12/9/4578/pdf

Reference29 articles.

1. Chinese Spelling Error Detection and Correction Based on Language Model, Pronunciation, and Shape

2. HANSpeller++: A Unified Framework for Chinese Spelling Correction

3. A Hybrid Approach to Automatic Corpus Generation for Chinese Spelling Check

4. Semisupervised Feature Selection Based on Relevance and Redundancy Criteria

5. Limited-energy output formation for multiagent systems with intermittent interactions