Biomedical named entity recognition and linking datasets: survey and our recent development

Author:

Huang Ming-Siang1,Lai Po-Ting2,Lin Pei-Yen3,You Yu-Ting4,Tsai Richard Tzong-Han5,Hsu Wen-Lian4

Affiliation:

1. Bioinformatics Program, Taiwan International Graduate Program, Institute of Information Science, Academia Sinica, Taipei, Taiwan

2. Institute of Biomedical Informatics, National Yang Ming University, Taipei, Taiwan

3. Department of Computer Science, National Tsing-Hua University, Hsinchu, Taiwan

4. Intelligent Agent Systems Laboratory, Institute of Information Science, Academia Sinica, Taipei, Taiwan

5. Intelligent Information Service Research Laboratory, Department of Computer Science and Information Engineering, National Central University, Taoyuan, Taiwan

Abstract

AbstractNatural language processing (NLP) is widely applied in biological domains to retrieve information from publications. Systems to address numerous applications exist, such as biomedical named entity recognition (BNER), named entity normalization (NEN) and protein–protein interaction extraction (PPIE). High-quality datasets can assist the development of robust and reliable systems; however, due to the endless applications and evolving techniques, the annotations of benchmark datasets may become outdated and inappropriate. In this study, we first review commonlyused BNER datasets and their potential annotation problems such as inconsistency and low portability. Then, we introduce a revised version of the JNLPBA dataset that solves potential problems in the original and use state-of-the-art named entity recognition systems to evaluate its portability to different kinds of biomedical literature, including protein–protein interaction and biology events. Lastly, we introduce an ensembled biomedical entity dataset (EBED) by extending the revised JNLPBA dataset with PubMed Central full-text paragraphs, figure captions and patent abstracts. This EBED is a multi-task dataset that covers annotations including gene, disease and chemical entities. In total, it contains 85000 entity mentions, 25000 entity mentions with database identifiers and 5000 attribute tags. To demonstrate the usage of the EBED, we review the BNER track from the AI CUP Biomedical Paper Analysis challenge. Availability: The revised JNLPBA dataset is available at https://iasl-btm.iis.sinica.edu.tw/BNER/Content/Re vised_JNLPBA.zip. The EBED dataset is available at https://iasl-btm.iis.sinica.edu.tw/BNER/Content/AICUP _EBED_dataset.rar. Contact: Email: thtsai@g.ncu.edu.tw, Tel. 886-3-4227151 ext. 35203, Fax: 886-3-422-2681 Email: hsu@iis.sinica.edu.tw, Tel. 886-2-2788-3799 ext. 2211, Fax: 886-2-2782-4814 Supplementary information: Supplementary data are available at Briefings in Bioinformatics online.

Funder

Ministry of Education, Taiwan

Ministry of Science and Technology, Taiwan

Publisher

Oxford University Press (OUP)

Subject

Molecular Biology,Information Systems

Reference117 articles.

1. The rate of growth in scientific publication and the decline in coverage provided by science citation index;Larsen;Scientometrics,2010

2. Growth rates of modern science: a bibliometric analysis based on the number of publications and cited references;Bornmann;J Assoc Inf Sci Technol,2015

3. Literature mining for the biologist: from information retrieval to biological discovery;Jensen,2006

4. The impact of named entity normalization on information retrieval for question answering;Khalid,2008

5. Overview of BioNLP shared task 2013;Nédellec;Proceedings of the BioNLP Shared Task 2013 Workshop

Cited by 35 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3