FastHPOCR: pragmatic, fast, and accurate concept recognition using the human phenotype ontology

Author:

Groza Tudor1234ORCID,Gration Dylan5,Baynam Gareth1256,Robinson Peter N78ORCID

Affiliation:

1. Rare Care Centre, Perth Children’s Hospital , Nedlands, WA 6009, Australia

2. Telethon Kids Institute , Nedlands, WA 6009, Australia

3. School of Electrical Engineering, Computing and Mathematical Sciences, Curtin University , Bentley, WA 6102, Australia

4. SingHealth Duke-NUS Institute of Precision Medicine , Singapore 169609, Singapore

5. Western Australian Register of Developmental Anomalies, King Edward Memorial Hospital , Subiaco, WA 6008, Australia

6. Faculty of Health and Medical Sciences, University of Western Australia , Crawley, WA 6009, Australia

7. Berlin Institute of Health at Charité – Universitätsmedizin Berlin, Charitéplatz 1 , 10117 Berlin, Germany

8. The Jackson Laboratory for Genomic Medicine , Farmington, CT 06032, United States

Abstract

Abstract Motivation Human Phenotype Ontology (HPO)-based phenotype concept recognition (CR) underpins a faster and more effective mechanism to create patient phenotype profiles or to document novel phenotype-centred knowledge statements. While the increasing adoption of large language models (LLMs) for natural language understanding has led to several LLM-based solutions, we argue that their intrinsic resource-intensive nature is not suitable for realistic management of the phenotype CR lifecycle. Consequently, we propose to go back to the basics and adopt a dictionary-based approach that enables both an immediate refresh of the ontological concepts as well as efficient re-analysis of past data. Results We developed a dictionary-based approach using a pre-built large collection of clusters of morphologically equivalent tokens—to address lexical variability and a more effective CR step by reducing the entity boundary detection strictly to candidates consisting of tokens belonging to ontology concepts. Our method achieves state-of-the-art results (0.76 F1 on the GSC+ corpus) and a processing efficiency of 10 000 publication abstracts in 5 s. Availability and implementation FastHPOCR is available as a Python package installable via pip. The source code is available at https://github.com/tudorgroza/fast_hpo_cr. A Java implementation of FastHPOCR will be made available as part of the Fenominal Java library available at https://github.com/monarch-initiative/fenominal. The up-to-date GCS-2024 corpus is available at https://github.com/tudorgroza/code-for-papers/tree/main/gsc-2024.

Funder

European Union’s Horizon 2020 research and innovation program

Publisher

Oxford University Press (OUP)

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3