Author:
Dong Hang,Suárez-Paniagua Víctor,Zhang Huayu,Wang Minhong,Casey Arlene,Davidson Emma,Chen Jiaoyan,Alex Beatrice,Whiteley William,Wu Honghan
Abstract
Abstract
Background
Computational text phenotyping is the practice of identifying patients with certain disorders and traits from clinical notes. Rare diseases are challenging to be identified due to few cases available for machine learning and the need for data annotation from domain experts.
Methods
We propose a method using ontologies and weak supervision, with recent pre-trained contextual representations from Bi-directional Transformers (e.g. BERT). The ontology-driven framework includes two steps: (i) Text-to-UMLS, extracting phenotypes by contextually linking mentions to concepts in Unified Medical Language System (UMLS), with a Named Entity Recognition and Linking (NER+L) tool, SemEHR, and weak supervision with customised rules and contextual mention representation; (ii) UMLS-to-ORDO, matching UMLS concepts to rare diseases in Orphanet Rare Disease Ontology (ORDO). The weakly supervised approach is proposed to learn a phenotype confirmation model to improve Text-to-UMLS linking, without annotated data from domain experts. We evaluated the approach on three clinical datasets, MIMIC-III discharge summaries, MIMIC-III radiology reports, and NHS Tayside brain imaging reports from two institutions in the US and the UK, with annotations.
Results
The improvements in the precision were pronounced (by over 30% to 50% absolute score for Text-to-UMLS linking), with almost no loss of recall compared to the existing NER+L tool, SemEHR. Results on radiology reports from MIMIC-III and NHS Tayside were consistent with the discharge summaries. The overall pipeline processing clinical notes can extract rare disease cases, mostly uncaptured in structured data (manually assigned ICD codes).
Conclusion
The study provides empirical evidence for the task by applying a weakly supervised NLP pipeline on clinical notes. The proposed weak supervised deep learning approach requires no human annotation except for validation and testing, by leveraging ontologies, NER+L tools, and contextual representations. The study also demonstrates that Natural Language Processing (NLP) can complement traditional ICD-based approaches to better estimate rare diseases in clinical notes. We discuss the usefulness and limitations of the weak supervision approach and propose directions for future studies.
Funder
Health Data Research UK
Wellcome Trust
Engineering and Physical Sciences Research Council
Advanced Care Research Centre
Publisher
Springer Science and Business Media LLC
Subject
Health Informatics,Health Policy,Computer Science Applications
Reference54 articles.
1. Nguengang Wakap S, Lambert DM, Olry A, Rodwell C, Gueydan C, Lanneau V, et al. Estimating cumulative point prevalence of rare diseases: analysis of the Orphanet database. Eur J Hum Genet. 2020;28(2):165–73.
2. Department of Health & Social Care. The UK Rare Diseases Framework. 2021. https://www.gov.uk/government/publications/uk-rare-diseases-framework/the-uk-rare-diseases-framework. Accessed 8 May 2022.
3. Scottish Government. Illnesses and long-term conditions. 2021. https://www.gov.scot/policies/illnesses-and-long-term-conditions/rare-diseases/. Accessed 22 Mar 2021.
4. Richesson RL, Fung KW, Bodenreider O. Coverage of Rare Disease Names in Clinical Coding Systems and Ontologies and Implications for Electronic Health Records-Based Research. In: Proceedings of the 5th International Conference on Biomedical Ontology. Houston: CEUR Workshop Proceedings (CEUR-WS.org); 2014. p. 78–80.
5. Bearryman E. Does your rare disease have a code? 2016. https://www.eurordis.org/news/does-your-rare-disease-have-code. Accessed 29 July 2021.
Cited by
9 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献