Performance of ChatGPT-4o in Real-Time Medical Consultation for Retroperitoneal Fibrosis Patients Under Doctor Supervision: A Cross-Sectional Study in a Chinese Clinical Setting (Preprint)-Reference-Cited by-同舟云学术

Performance of ChatGPT-4o in Real-Time Medical Consultation for Retroperitoneal Fibrosis Patients Under Doctor Supervision: A Cross-Sectional Study in a Chinese Clinical Setting (Preprint)

Published:2024-07-30 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Gao Hui,Zhang Wuji^ORCID,Liu Shibo,Li Yuanning,Zhu Yingxi,Long Ting,Yu Ruohan,Guo Qian,Zou Yadan,Li Ji,Zhang Lina,Yang Cui,Tong Yubing,Zhang Xuewu

Abstract

BACKGROUND

LLMs like GPT-4 show promise in medical consultations but face challenges in non-English or real-time contexts. The new GPT-4o, with improved text processing and faster responses, may better address rare diseases like retroperitoneal fibrosis (RPF).

OBJECTIVE

Performance of GPT-4o in providing real-time medical consultations for patients with rare disease remains underexplored, which is generally a challenge in clinical practice. We evaluate the competency of GPT-4o to generate responses to a rare autoimmune RPF on accuracy, completeness, readability, and quality, using a 7-point Likert scale.

METHODS

A total of 103 real-world RPF patients queries were collected from diverse sources. Responses were generated using the newly released version of GPT-4o (2024/5/17). All questions were also stratified and randomly divided into six groups. Six attending rheumatologists were assigned to answer one set of questions, then generated new responses with assistance of GPT-4o. All the responses were assessed blindly by three experts in RPF.

RESULTS

GPT-4o scored significantly higher than rheumatologists in accuracy (6.39 ± 0.50 vs. 4.99 ± 0.62), completeness (6.51 ± 0.44 vs. 4.55 ± 0.60), readability (6.45 ± 0.42 vs. 4.93 ± 0.59), and quality (6.42 ± 0.46 vs. 4.78 ± 0.55) (p < 0.001). Competency of rheumatologists + GPT-4o was better than that of rheumatologists alone (accuracy: 6.13 ± 0.63, completeness: 5.99 ± 0.81, readability: 6.05 ± 0.67, quality: 6.01 ± 0.71. p < 0.001), and physician revisions generally reduced the competency of GPT-4o. Subgroup analysis showed no significant difference on accuracy between GPT-4o and rheumatologists + GPT-4o in answering complex questions, but any type of revision lowered the competency of GPT-4o.

CONCLUSIONS

GPT-4o has the potential to provide real-time medical consultations for RPF in the Chinese clinical environment.

Publisher

JMIR Publications Inc.

Reference22 articles.

1. Assessing the ability of an artificial intelligence chatbot to translate dermatopathology reports into patient-friendly language: A cross-sectional study

2. Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum

3. Responses of Five Different Artificial Intelligence Chatbots to the Top Searched Queries About Erectile Dysfunction: A Comparative Analysis

4. Chatbot vs Medical Student Performance on Free-Response Clinical Reasoning Examinations

5. ChatGPT-4: An assessment of an upgraded artificial intelligence chatbot in the United States Medical Licensing Examination