Evaluating the Appropriateness, Consistency, and Readability of ChatGPT in Critical Care Recommendations-Reference-Cited by-同舟云学术

Evaluating the Appropriateness, Consistency, and Readability of ChatGPT in Critical Care Recommendations

Published:2024-08-08 Issue: Volume: Page:
ISSN:0885-0666
Container-title:Journal of Intensive Care Medicine
language:en
Short-container-title:J Intensive Care Med

Author:

Balta Kaan Y.¹^ORCID,Javidan Arshia P.²,Walser Eric³⁴,Arntfield Robert³^ORCID,Prager Ross³

Affiliation:

1. Schulich School of Medicine & Dentistry, Western University, London, Ontario, Canada

2. Division of Vascular Surgery, Department of Surgery, University of Toronto, Toronto, Ontario, Canada

3. Division of Critical Care, London Health Sciences Centre, Western University, London, Ontario, Canada

4. Department of Surgery, Trauma Program, London Health Sciences Centre, London, Ontario, Canada

Abstract

Background: We assessed 2 versions of the large language model (LLM) ChatGPT—versions 3.5 and 4.0—in generating appropriate, consistent, and readable recommendations on core critical care topics. Research Question: How do successive large language models compare in terms of generating appropriate, consistent, and readable recommendations on core critical care topics? Design and Methods: A set of 50 LLM-generated responses to clinical questions were evaluated by 2 independent intensivists based on a 5-point Likert scale for appropriateness, consistency, and readability. Results: ChatGPT 4.0 showed significantly higher median appropriateness scores compared to ChatGPT 3.5 (4.0 vs 3.0, P < .001). However, there was no significant difference in consistency between the 2 versions (40% vs 28%, P = 0.291). Readability, assessed by the Flesch-Kincaid Grade Level, was also not significantly different between the 2 models (14.3 vs 14.4, P = 0.93). Interpretation: Both models produced “hallucinations”—misinformation delivered with high confidence—which highlights the risk of relying on these tools without domain expertise. Despite potential for clinical application, both models lacked consistency producing different results when asked the same question multiple times. The study underscores the need for clinicians to understand the strengths and limitations of LLMs for safe and effective implementation in critical care settings. Registration: https://osf.io/8chj7/

Publisher

SAGE Publications

Link

https://journals.sagepub.com/doi/pdf/10.1177/08850666241267871

Reference23 articles.

1. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models

2. Revolutionizing radiology with GPT-based models: Current applications, future possibilities and limitations of ChatGPT

3. Evaluating GPT as an Adjunct for Radiologic Decision Making: GPT-4 Versus GPT-3.5 in a Breast Imaging Pilot

4. Comparing scientific abstracts generated by ChatGPT to real abstracts with detectors and blinded human reviewers