Large Language Models Facilitate the Generation of Electronic Health Record Phenotyping Algorithms-Reference-Cited by-同舟云学术

Large Language Models Facilitate the Generation of Electronic Health Record Phenotyping Algorithms

Published:2023-12-19 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Yan Chao^ORCID,Ong Henry H.,Grabowska Monika E.,Krantz Matthew S.,Su Wu-Chen,Dickson Alyson L.,Peterson Josh F.,Feng QiPing^ORCID,Roden Dan M.^ORCID,Stein C. Michael,Kerchberger V. Eric^ORCID,Malin Brad A.,Wei Wei-Qi

Abstract

ABSTRACTObjectivesPhenotyping is a core task in observational health research utilizing electronic health records (EHRs). Developing an accurate algorithm typically demands substantial input from domain experts, involving extensive literature review and evidence synthesis. This burdensome process limits scalability and delays knowledge discovery. We investigate the potential for leveraging large language models (LLMs) to enhance the efficiency of EHR phenotyping by generating drafts of high-quality algorithms.Materials and MethodsWe prompted four LLMs—ChatGPT-4, ChatGPT-3.5, Claude 2, and Bard—in October 2023, asking them to generate executable phenotyping algorithms in the form of SQL queries adhering to a common data model for three clinical phenotypes (i.e., type 2 diabetes mellitus, dementia, and hypothyroidism). Three phenotyping experts evaluated the returned algorithms across several critical metrics. We further implemented the top-rated algorithms from each LLM and compared them against clinician-validated phenotyping algorithms from the Electronic Medical Records and Genomics (eMERGE) network.ResultsChatGPT-4 and ChatGPT-3.5 exhibited significantly higher overall expert evaluation scores in instruction following, algorithmic logic, and SQL executability, when compared to Claude 2 and Bard. Although ChatGPT-4 and ChatGPT-3.5 effectively identified relevant clinical concepts, they exhibited immature capability in organizing phenotyping criteria with the proper logic, leading to phenotyping algorithms that were either excessively restrictive (with low recall) or overly broad (with low positive predictive values).ConclusionBoth ChatGPT versions 3.5 and 4 demonstrate the capability to enhance EHR phenotyping efficiency by drafting algorithms of reasonable quality. However, the optimal performance of these algorithms necessitates the involvement of domain experts.

Publisher

Cold Spring Harbor Laboratory

Reference32 articles.

1. Extracting research-quality phenotypes from electronic health records to support precision medicine

2. Advances in electronic phenotyping: From rule-based definitions to machine learning models;Annu Rev Biomed Data Sci,2018

3. Evaluation of the portability of computable phenotypes with natural language processing in the eMERGE network;Sci Rep,2023

4. Validation of electronic medical record-based phenotyping algorithms: results and lessons learned from the eMERGE network

5. PheMap: a multi-resource knowledge base for high-throughput phenotyping within electronic health records;J Am Med Inform Assoc,2020