PhenoTagger: a hybrid method for phenotype concept recognition using human phenotype ontology-Reference-Cited by-同舟云学术

PhenoTagger: a hybrid method for phenotype concept recognition using human phenotype ontology

Published:2021-01-20 Issue:13 Volume:37 Page:1884-1890
ISSN:1367-4803
Container-title:Bioinformatics
language:en
Short-container-title:

Author:

Luo Ling¹^ORCID,Yan Shankai¹^ORCID,Lai Po-Ting¹,Veltri Daniel²^ORCID,Oler Andrew²,Xirasagar Sandhya²,Ghosh Rajarshi²,Similuk Morgan²,Robinson Peter N³^ORCID,Lu Zhiyong¹^ORCID

Affiliation:

1. National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA

2. Bioinformatics and Computational Biosciences Branch, Office of Cyber Infrastructure and Computational Biology, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, MD 209892, USA

3. The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA

Abstract

Abstract Motivation Automatic phenotype concept recognition from unstructured text remains a challenging task in biomedical text mining research. Previous works that address the task typically use dictionary-based matching methods, which can achieve high precision but suffer from lower recall. Recently, machine learning-based methods have been proposed to identify biomedical concepts, which can recognize more unseen concept synonyms by automatic feature learning. However, most methods require large corpora of manually annotated data for model training, which is difficult to obtain due to the high cost of human annotation. Results In this article, we propose PhenoTagger, a hybrid method that combines both dictionary and machine learning-based methods to recognize Human Phenotype Ontology (HPO) concepts in unstructured biomedical text. We first use all concepts and synonyms in HPO to construct a dictionary, which is then used to automatically build a distantly supervised training dataset for machine learning. Next, a cutting-edge deep learning model is trained to classify each candidate phrase (n-gram from input sentence) into a corresponding concept label. Finally, the dictionary and machine learning-based prediction results are combined for improved performance. Our method is validated with two HPO corpora, and the results show that PhenoTagger compares favorably to previous methods. In addition, to demonstrate the generalizability of our method, we retrained PhenoTagger using the disease ontology MEDIC for disease concept recognition to investigate the effect of training on different ontologies. Experimental results on the NCBI disease corpus show that PhenoTagger without requiring manually annotated training data achieves competitive performance as compared with state-of-the-art supervised methods. Availabilityand implementation The source code, API information and data for PhenoTagger are freely available at https://github.com/ncbi-nlp/PhenoTagger. Supplementary information Supplementary data are available at Bioinformatics online.

Funder

Intramural Research Programs of the National Institutes of Health, National Library of Medicine

Publisher

Oxford University Press (OUP)

Subject

Computational Mathematics,Computational Theory and Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Statistics and Probability

Link

http://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btab019/36158831/btab019.pdf

Reference32 articles.

1. Identifying clinical terms in medical text using Ontology-Guided machine learning;Arbabi;JMIR Med. Inf,2019

2. Concept recognition for extracting protein interaction relations from biomedical text;Baumgartner;Genome Biol,2008

3. Random search for hyper-parameter optimization;Bergstra;J. Mach. Learn. Res,2012

Cited by 28 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Exploring the reversal curse and other deductive logical reasoning in BERT and GPT-based large language models;Patterns;2024-09

2. RareBench: Can LLMs Serve as Rare Diseases Specialists?;Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining;2024-08-24

3. Towards automated phenotype definition extraction using large language models;2024-08-21

4. Online Mendelian Inheritance in Animals (OMIA): a genetic resource for vertebrate animals;Mammalian Genome;2024-08-14

5. Addressing diagnostic gaps and priorities of the global rare diseases community: Recommendations from the IRDiRC diagnostics scientific committee;European Journal of Medical Genetics;2024-08