ChatGPT-4 Knows Its A B C D E but Cannot Cite Its Source-Reference-Cited by-同舟云学术

ChatGPT-4 Knows Its A B C D E but Cannot Cite Its Source

Published:2024-07 Issue:3 Volume:9 Page:
ISSN:2472-7245
Container-title:JBJS Open Access
language:en
Short-container-title:

Author:

Ghanem Diane¹^ORCID,Zhu Alexander R.²^ORCID,Kagabo Whitney¹^ORCID,Osgood Greg¹^ORCID,Shafiq Babar¹^ORCID

Affiliation:

1. Department of Orthopaedic Surgery, The Johns Hopkins Hospital, Baltimore, Maryland

2. School of Medicine, The Johns Hopkins University, Baltimore, Maryland

Abstract

Introduction: The artificial intelligence language model Chat Generative Pretrained Transformer (ChatGPT) has shown potential as a reliable and accessible educational resource in orthopaedic surgery. Yet, the accuracy of the references behind the provided information remains elusive, which poses a concern for maintaining the integrity of medical content. This study aims to examine the accuracy of the references provided by ChatGPT-4 concerning the Airway, Breathing, Circulation, Disability, Exposure (ABCDE) approach in trauma surgery. Methods: Two independent reviewers critically assessed 30 ChatGPT-4–generated references supporting the well-established ABCDE approach to trauma protocol, grading them as 0 (nonexistent), 1 (inaccurate), or 2 (accurate). All discrepancies between the ChatGPT-4 and PubMed references were carefully reviewed and bolded. Cohen's Kappa coefficient was used to examine the agreement of the accuracy scores of the ChatGPT-4–generated references between reviewers. Descriptive statistics were used to summarize the mean reference accuracy scores. To compare the variance of the means across the 5 categories, one-way analysis of variance was used. Results: ChatGPT-4 had an average reference accuracy score of 66.7%. Of the 30 references, only 43.3% were accurate and deemed “true” while 56.7% were categorized as “false” (43.3% inaccurate and 13.3% nonexistent). The accuracy was consistent across the 5 trauma protocol categories, with no significant statistical difference (p = 0.437). Discussion: With 57% of references being inaccurate or nonexistent, ChatGPT-4 has fallen short in providing reliable and reproducible references—a concerning finding for the safety of using ChatGPT-4 for professional medical decision making without thorough verification. Only if used cautiously, with cross-referencing, can this language model act as an adjunct learning tool that can enhance comprehensiveness as well as knowledge rehearsal and manipulation.

Publisher

Ovid Technologies (Wolters Kluwer Health)

Reference17 articles.

1. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine;Lee;New Engl J Med,2023

2. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models;Kung;PLOS Digit Health,2023

3. Evaluating ChatGPT performance on the orthopaedic in-training examination;Kung;JBJS Open Access,2023

4. ChatGPT performs at the level of a third-year orthopaedic surgery resident on the orthopaedic in-training examination;Ghanem;JBJS Open Access,2023

5. ChatGPT utility in healthcare education, research, and practice: systematic review on the promising perspectives and Valid concerns;Sallam;Healthcare,2023