A cyclic self-learning Chinese word segmentation for the geoscience domain-Reference-Cited by-同舟云学术

A cyclic self-learning Chinese word segmentation for the geoscience domain

Published:2018-03-01 Issue:1 Volume:72 Page:16-26
ISSN:1195-1036
Container-title:Geomatica
language:en
Short-container-title:Geomatica

Author:

Qiu Qinjun¹²,Xie Zhong¹²,Wu Liang¹²

Affiliation:

1. Department of Information Engineering, China University of Geosciences, Wuhan 430074, China.

2. National Engineering Research Center of Geographic Information System, Wuhan 430074, China.

Abstract

Unlike English and other western languages, Chinese does not delimit words using white-spaces. Chinese Word Segmentation (CWS) is the crucial first step towards natural language processing. However, for the geoscience subject domain, the CWS problem remains unresolved with many challenges. Although traditional methods can be used to process geoscience documents, they lack the domain knowledge for massive geoscience documents. Considering the above challenges, this motivated us to build a segmenter specifically for the geoscience domain. Currently, most of the state-of-the-art methods for Chinese word segmentation are based on supervised learning, whose features are mostly extracted from a local context. In this paper, we proposed a framework for sequence learning by incorporating cyclic self-learning corpus training. Following this framework, we build the GeoSegmenter based on the Bi-directional Long Short-Term Memory (Bi-LSTM) network model to perform Chinese word segmentation. It can gain a great advantage through iterations of the training data. Empirical experimental results on geoscience documents and benchmark datasets showed that geological documents can be identified, and it can also recognize the generic documents.

Publisher

Canadian Science Publishing

Subject

Earth-Surface Processes,Geography, Planning and Development

Link

http://www.nrcresearchpress.com/doi/pdf/10.1139/geomat-2018-0007

Reference12 articles.

1. Learning long-term dependencies with gradient descent is difficult

2. Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach

3. Framewise phoneme classification with bidirectional LSTM and other neural network architectures

4. Long Short-Term Memory

Cited by 25 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Constraint information extraction for 3D geological modelling using a span-based joint entity and relation extraction model;Earth Science Informatics;2024-02-16

2. Semantic information extraction and search of mineral exploration data using text mining and deep learning methods;Ore Geology Reviews;2024-02

3. A deep learning-based method for deep information extraction from multimodal data for geological reports to support geological knowledge graph construction;Earth Science Informatics;2024-01-08

4. Research on Chinese Word Segmentation Algorithm in the Tobacco Field Based on the BERT-BiLSTM-CRF Model;Lecture Notes in Electrical Engineering;2024

5. Integrated Geologic Terms and Dual Model for Chinese Geological Word Segmentation;Lecture Notes in Computer Science;2024