SEN: A subword-based ensemble network for Chinese historical entity extraction-Reference-Cited by-同舟云学术

SEN: A subword-based ensemble network for Chinese historical entity extraction

Published:2022-12-22 Issue: Volume: Page:1-23
ISSN:1351-3249
Container-title:Natural Language Engineering
language:en
Short-container-title:Nat. Lang. Eng.

Author:

Yan Chengxi^ORCID,Wang Ruojia,Fang Xiaoke

Abstract

Abstract Understanding various historical entity information (e.g., persons, locations, and time) plays a very important role in reasoning about the developments of historical events. With the increasing concern about the fields of digital humanities and natural language processing, named entity recognition (NER) provides a feasible solution for automatically extracting these entities from historical texts, especially in Chinese historical research. However, previous approaches are domain-specific, ineffective with relatively low accuracy, and non-interpretable, which hinders the development of NER in Chinese history. In this paper, we propose a new hybrid deep learning model called “subword-based ensemble network” (SEN), by incorporating subword information and a novel attention fusion mechanism. The experiments on a massive self-built Chinese historical corpus CMAG show that SEN has achieved the best with 93.87% for F1-micro and 89.70% for F1-macro, compared with other advanced models. Further investigation reveals that SEN has a strong generalization ability of NER on Chinese historical texts, which is not only relatively insensitive to the categories with fewer annotation labels (e.g., OFI) but can also accurately capture diverse local and global semantic relations. Our research demonstrates the effectiveness of the integration of subword information and attention fusion, which provides an inspiring solution for the practical use of entity extraction in the Chinese historical domain.

Publisher

Cambridge University Press (CUP)

Subject

Artificial Intelligence,Linguistics and Language,Language and Linguistics,Software

Reference51 articles.

1. The viterbi algorithm

2. Li, L. , Mao, T. , Huang, D. and Yang, Y. (2006). Hybrid models for Chinese named entity recognition. In Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 72–78.

3. Vaswani, A. , Shazeer, N. , Parmar, N. , Uszkoreit, J. , Jones, L. , Gomez, A.N. , Kaiser, Ł. and Polosukhin, I. (2017). Attention is all you need. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems. NeurIPS, pp. 5998–6008.

4. Leong, K.S. , Wong, F. , Li, Y. and Dong, M.C. (2008). Chinese tagging based on maximum entropy model. In Proceedings of the 6th SIGHAN Workshop on Chinese Language Processing, pp. 138–142.

5. Chen, A. , Peng, F. , Shan, R. and Sun, G. (2006). Chinese named entity recognition with conditional probabilistic models. In Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 173–176.