Data set entity recognition based on distant supervision-Reference-Cited by-同舟云学术

Data set entity recognition based on distant supervision

Published:2021-07-26 Issue:3 Volume:39 Page:435-449
ISSN:0264-0473
Container-title:The Electronic Library
language:en
Short-container-title:EL

Author:

Li Pengcheng,Liu Qikai,Cheng Qikai,Lu Wei

Abstract

Purpose This paper aims to identify data set entities in scientific literature. To address poor recognition caused by a lack of training corpora in existing studies, a distant supervised learning-based approach is proposed to identify data set entities automatically from large-scale scientific literature in an open domain. Design/methodology/approach Firstly, the authors use a dictionary combined with a bootstrapping strategy to create a labelled corpus to apply supervised learning. Secondly, a bidirectional encoder representation from transformers (BERT)-based neural model was applied to identify data set entities in the scientific literature automatically. Finally, two data augmentation techniques, entity replacement and entity masking, were introduced to enhance the model generalisability and improve the recognition of data set entities. Findings In the absence of training data, the proposed method can effectively identify data set entities in large-scale scientific papers. The BERT-based vectorised representation and data augmentation techniques enable significant improvements in the generality and robustness of named entity recognition models, especially in long-tailed data set entity recognition. Originality/value This paper provides a practical research method for automatically recognising data set entities in scientific literature. To the best of the authors’ knowledge, this is the first attempt to apply distant learning to the study of data set entity recognition. The authors introduce a robust vectorised representation and two data augmentation strategies (entity replacement and entity masking) to address the problem inherent in distant supervised learning methods, which the existing research has mostly ignored. The experimental results demonstrate that our approach effectively improves the recognition of data set entities, especially long-tailed data set entities.

Publisher

Emerald

Subject

Library and Information Sciences,Computer Science Applications

Reference35 articles.

1. Pooled contextualized embeddings for named entity recognition,2019

2. Distant supervision for silver label generation of software mentions in social scientific publications,2019

3. BERT: pre-training of deep bidirectional transformers for language understanding,2018

4. Dong, X., Qian, L., Guan, Y., Huang, L., Yu, Q. and Yang, J. (2016), “A multiclass classification method based on deep learning for named entity recognition in electronic medical records”, paper presented at the New York, NY Scientific Data Summit (NYSDS ‘16), IEEE.

5. Ambiguity and variability of database and software names in bioinformatics;Journal of Biomedical Semantics,2015

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Low-resource multi-granularity academic function recognition based on multiple prompt knowledge;The Electronic Library;2024-08-22

2. A term function–aware keyword citation network method for science mapping analysis;Information Processing & Management;2023-07

3. From “what” to “how”: Extracting the Procedural Scientific Information Toward the Metric-optimization in AI;Information Processing & Management;2023-05

4. Exploring developments of the AI field from the perspective of methods, datasets, and metrics;Information Processing & Management;2023-03