Evaluating ChatGPT in Information Extraction: A Case Study of Extracting Cognitive Exam Dates and Scores-Reference-Cited by-同舟云学术

Evaluating ChatGPT in Information Extraction: A Case Study of Extracting Cognitive Exam Dates and Scores

Published:2023-07-12 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Jethani Neil^ORCID,Jones Simon,Genes Nicholas^ORCID,Major Vincent J.^ORCID,Jaffe Ian S.^ORCID,Cardillo Anthony B.^ORCID,Heilenbach Noah^ORCID,Ali Nadia Fazal^ORCID,Bonanni Luke J.^ORCID,Clayburn Andrew J.^ORCID,Khera Zain^ORCID,Sadler Erica C.^ORCID,Prasad Jaideep^ORCID,Schlacter Jamie^ORCID,Liu Kevin^ORCID,Silva Benjamin^ORCID,Montgomery Sophie^ORCID,Kim Eric J.^ORCID,Lester Jacob^ORCID,Hill Theodore M.^ORCID,Avoricani Alba^ORCID,Chervonski Ethan^ORCID,Davydov James,Small William,Chakravartty Eesha^ORCID,Grover Himanshu,Dodson John A.,Brody Abraham A.^ORCID,Aphinyanaphongs Yindalon^ORCID,Razavian Narges^ORCID

Abstract

AbstractBackgroundLarge language models (LLMs) provide powerful natural language processing (NLP) capabilities in medical and clinical tasks. Evaluating LLM performance is crucial due to potential false results. In this study, we assessed ChatGPT, a state-of-the-art LLM, in extracting information from clinical notes, focusing on cognitive tests, specifically the Mini Mental State Exam (MMSE) and the Cognitive Dementia Rating (CDR). We tasked ChatGPT with extracting MMSE and CDR scores and corresponding dates from clinical notes.MethodsOur cohort had 135,307 clinical notes (Jan 12th, 2010 to May 24th, 2023) mentioning MMSE, CDR, or Montreal Cognitive Assessment (MoCA). After applying inclusion criteria and excluding notes with only MoCA, 34,465 notes remained. Among them, 765 were randomly selected and underwent analysis by ChatGPT. 22 medically-trained experts reviewed ChatGPT’s responses and provided ground truth. ChatGPT (GPT-4, version "2023-03-15-preview") was used on the 765 notes to extract MMSE and CDR instances with corresponding dates. Inference was successful for 742 notes. We used 20 notes for fine-tuning and training the reviewers. The remaining 722 were assigned to reviewers, with 309 assigned to two reviewers simultaneously. Inter-rater-agreement (Fleiss’ Kappa), precision, recall, true/false negative rates, and accuracy were calculated.ResultsFor MMSE information extraction, ChatGPT achieved 83% accuracy. It demonstrated high sensitivity with a Macro-recall of 89.7% and outstanding true-negative rates of 96%. The precision for MMSE was also high at 82.7%. In the case of CDR information extraction, ChatGPT achieved 89% accuracy. It showed excellent sensitivity with a Macro-recall of 91.3% and a perfect true-negative rate of 100%. However, the precision for CDR was lower at 57%. Analyzing the ground truth data, it was found that 89.1% of the notes included an MMSE documentation, whereas only 14.3% had a CDR documentation, which affected the precision of CDR extraction. Inter-rater-agreement was substantial, supporting the validity of our findings. Reviewers considered ChatGPT’s responses correct (96% for MMSE, 98% for CDR) and complete (84% for MMSE, 83% for CDR).ConclusionChatGPT demonstrates overall accuracy in extracting MMSE and CDR scores and dates, potentially benefiting dementia research and clinical care. Prior probability of the information appearing in the text impacted ChatGPT’s precision. Rigorous evaluation of LLMs for diverse medical tasks is crucial to understand their capabilities and limitations.

Publisher

Cold Spring Harbor Laboratory

Reference100 articles.

1. OpenAI. ChatGPT. 2023 [cited 3 Jul 2023]. Available: http://openai.com/chatgpt (accessed June 2023)

2. OpenAI. GPT-4 Technical Report. arXiv [cs.CL]. 2023. Available: http://arxiv.org/abs/2303.08774

3. Singhal K , Tu T , Gottweis J , Sayres R , Wulczyn E , Hou L , et al. Towards Expert-Level Medical Question Answering with Large Language Models. arXiv [cs.CL]. 2023. Available: http://arxiv.org/abs/2305.09617

4. Touvron H , Lavril T , Izacard G , Martinet X , Lachaux M-A , Lacroix T , et al. LLaMA: Open and Efficient Foundation Language Models. arXiv [cs.CL]. 2023. Available: http://arxiv.org/abs/2302.13971

5. Bubeck S , Chandrasekaran V , Eldan R , Gehrke J , Horvitz E , Kamar E , et al. Sparks of Artificial General Intelligence: Early experiments with GPT-4. arXiv [cs.CL]. 2023. Available: http://arxiv.org/abs/2303.12712