Evaluating ChatGPT in Information Extraction: A Case Study of Extracting Cognitive Exam Dates and Scores

Author:

Jethani NeilORCID,Jones Simon,Genes NicholasORCID,Major Vincent J.ORCID,Jaffe Ian S.ORCID,Cardillo Anthony B.ORCID,Heilenbach NoahORCID,Ali Nadia FazalORCID,Bonanni Luke J.ORCID,Clayburn Andrew J.ORCID,Khera ZainORCID,Sadler Erica C.ORCID,Prasad JaideepORCID,Schlacter JamieORCID,Liu KevinORCID,Silva BenjaminORCID,Montgomery SophieORCID,Kim Eric J.ORCID,Lester JacobORCID,Hill Theodore M.ORCID,Avoricani AlbaORCID,Chervonski EthanORCID,Davydov James,Small William,Chakravartty EeshaORCID,Grover Himanshu,Dodson John A.,Brody Abraham A.ORCID,Aphinyanaphongs YindalonORCID,Razavian NargesORCID

Abstract

AbstractBackgroundLarge language models (LLMs) provide powerful natural language processing (NLP) capabilities in medical and clinical tasks. Evaluating LLM performance is crucial due to potential false results. In this study, we assessed ChatGPT, a state-of-the-art LLM, in extracting information from clinical notes, focusing on cognitive tests, specifically the Mini Mental State Exam (MMSE) and the Cognitive Dementia Rating (CDR). We tasked ChatGPT with extracting MMSE and CDR scores and corresponding dates from clinical notes.MethodsOur cohort had 135,307 clinical notes (Jan 12th, 2010 to May 24th, 2023) mentioning MMSE, CDR, or Montreal Cognitive Assessment (MoCA). After applying inclusion criteria and excluding notes with only MoCA, 34,465 notes remained. Among them, 765 were randomly selected and underwent analysis by ChatGPT. 22 medically-trained experts reviewed ChatGPT’s responses and provided ground truth. ChatGPT (GPT-4, version "2023-03-15-preview") was used on the 765 notes to extract MMSE and CDR instances with corresponding dates. Inference was successful for 742 notes. We used 20 notes for fine-tuning and training the reviewers. The remaining 722 were assigned to reviewers, with 309 assigned to two reviewers simultaneously. Inter-rater-agreement (Fleiss’ Kappa), precision, recall, true/false negative rates, and accuracy were calculated.ResultsFor MMSE information extraction, ChatGPT achieved 83% accuracy. It demonstrated high sensitivity with a Macro-recall of 89.7% and outstanding true-negative rates of 96%. The precision for MMSE was also high at 82.7%. In the case of CDR information extraction, ChatGPT achieved 89% accuracy. It showed excellent sensitivity with a Macro-recall of 91.3% and a perfect true-negative rate of 100%. However, the precision for CDR was lower at 57%. Analyzing the ground truth data, it was found that 89.1% of the notes included an MMSE documentation, whereas only 14.3% had a CDR documentation, which affected the precision of CDR extraction. Inter-rater-agreement was substantial, supporting the validity of our findings. Reviewers considered ChatGPT’s responses correct (96% for MMSE, 98% for CDR) and complete (84% for MMSE, 83% for CDR).ConclusionChatGPT demonstrates overall accuracy in extracting MMSE and CDR scores and dates, potentially benefiting dementia research and clinical care. Prior probability of the information appearing in the text impacted ChatGPT’s precision. Rigorous evaluation of LLMs for diverse medical tasks is crucial to understand their capabilities and limitations.

Publisher

Cold Spring Harbor Laboratory

Reference100 articles.

1. OpenAI. ChatGPT. 2023 [cited 3 Jul 2023]. Available: http://openai.com/chatgpt (accessed June 2023)

2. OpenAI. GPT-4 Technical Report. arXiv [cs.CL]. 2023. Available: http://arxiv.org/abs/2303.08774

3. Singhal K , Tu T , Gottweis J , Sayres R , Wulczyn E , Hou L , et al. Towards Expert-Level Medical Question Answering with Large Language Models. arXiv [cs.CL]. 2023. Available: http://arxiv.org/abs/2305.09617

4. Touvron H , Lavril T , Izacard G , Martinet X , Lachaux M-A , Lacroix T , et al. LLaMA: Open and Efficient Foundation Language Models. arXiv [cs.CL]. 2023. Available: http://arxiv.org/abs/2302.13971

5. Bubeck S , Chandrasekaran V , Eldan R , Gehrke J , Horvitz E , Kamar E , et al. Sparks of Artificial General Intelligence: Early experiments with GPT-4. arXiv [cs.CL]. 2023. Available: http://arxiv.org/abs/2303.12712

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3