A new approach and gold standard toward author disambiguation in MEDLINE

Author:

Vishnyakova Dina1,Rodriguez-Esteban Raul1,Rinaldi Fabio234

Affiliation:

1. Roche Pharmaceutical Research and Early Development, pRED Informatics, Roche Innovation Center, Basel, Switzerland

2. Institute of Computational Linguistics, University of Zurich, Switzerland

3. Swiss Institute of Bioinformatics, Zurich, Switzerland

4. Dalle Molle Institute for Artificial Intelligence Research (IDSIA), Manno, Switzerland

Abstract

Abstract Objective Author-centric analyses of fast-growing biomedical reference databases are challenging due to author ambiguity. This problem has been mainly addressed through author disambiguation using supervised machine-learning algorithms. Such algorithms, however, require adequately designed gold standards that reflect the reference database properly. In this study we used MEDLINE to build the first unbiased gold standard in a reference database and improve over the existing state of the art in author disambiguation. Materials and Methods Following a new corpus design method, publication pairs randomly picked from MEDLINE were evaluated by both crowdsourcing and expert curators. Because the latter showed higher accuracy than crowdsourcing, expert curators were tasked to create a full corpus. The corpus was then used to explore new features that could improve state-of-the-art author disambiguation algorithms that would not have been discoverable with previously existing gold standards. Results We created a gold standard based on 1900 publication pairs that shows close similarity to MEDLINE in terms of chronological distribution and information completeness. A machine-learning algorithm that includes new features related to the ethnic origin of authors showed significant improvements over the current state of the art and demonstrates the necessity of realistic gold standards to further develop effective author disambiguation algorithms. Discussion and Conclusion An unbiased gold standard can give a more accurate picture of the status of author disambiguation research and help in the discovery of new features for machine learning. The principles and methods shown here can be applied to other reference databases beyond MEDLINE. The gold standard and code used for this study are available at the following repository: https://github.com/amorgani/AND/

Funder

Roche Postdoctoral Fellowship

Publisher

Oxford University Press (OUP)

Subject

Health Informatics

Reference27 articles.

Cited by 14 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Bridging the gap in author names: building an enhanced author name dataset for biomedical literature system;Journal of the American Medical Informatics Association;2024-06-25

2. Automatic author name disambiguation by differentiable feature selection;Journal of Information Science;2023-09-19

3. Web-Scale Academic Name Disambiguation: The WhoIsWho Benchmark, Leaderboard, and Toolkit;Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining;2023-08-04

4. Notes on the data quality of bibliographic records from the MEDLINE database;Database;2023-01-01

5. LAGOS‐AND : A large gold standard dataset for scholarly author name disambiguation;Journal of the Association for Information Science and Technology;2022-11-28

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3