Application of Large Language Models in Medical Training Evaluation: Can ChatGPT Be a Standardized Patient? An Exploratory Study (Preprint)

Author:

Wang ChenxuORCID,Li ShuhanORCID,Lin NuoxiORCID,Zhang Xinyu,Han Ying,Wang Xiandi,Liu DiORCID,Tan XiaomeiORCID,Pu Dan,Li KangORCID,Qian GuangwuORCID,Yin RongORCID

Abstract

BACKGROUND

With the increasing interest of Large Language Models’ (LLMs) application in the medical field, the feasibility of its potential usage as a Standardized Patient (SP) in medical assessment is rarely evaluated. Specifically, we delved into the potential of using ChatGPT, a representative LLM, in transforming medical education by serving as a cost-effective alternative to SPs, specifically for history-taking tasks.

OBJECTIVE

The study aims to explore ChatGPT's viability and performance as an SP, employing prompt engineering to refine its accuracy and utility in medical assessments.

METHODS

A two-phase experiment was designed to assess ChatGPT's viability as an SP in medical education. The first phase tested the feasibility through simulating conversations on Inflammatory Bowel Disease (IBD), categorizing responses into poor, medium, and good inquiries based on relevance and accuracy. For the second phase, a more structured experiment used detailed scripts to evaluate ChatGPT's performance against specific criteria, focusing on its anthropomorphism, clinical accuracy, and adaptability. Adjustments were made to prompts based on ChatGPT's response shortcomings, with a comparative analysis of ChatGPT’s performance between original and revised prompts to track improvements. The methodology included statistical analysis to ensure rigorous evaluation, with data collected between November and December 2023.

RESULTS

The feasibility test confirmed ChatGPT's ability to simulate an SP effectively, differentiating between poor, medium, and good medical inquiries with varying degrees of accuracy. Score differences between the poor (74.7, SD=5.44) and medium (82.67, SD=5.30) inquiry groups (P < .001), between the poor and good (85, SD=3.27) inquiry groups (P < .001) were significant at a significance level of α = .05, while the score differences between the medium and good inquiry groups were not statistically significant (P= .158). The feasibility test took 90 runs. However, the performance is not ideal without proper prompt restriction. Subsequent performance enhancements, including the use of revised prompts, instructed ChatGPT to avoid medical jargon for realism, provide accurate and concise responses for clinical accuracy, and improve its grading accuracy and adaptability by following specific prompts. The total number of trials in the second experimental phase was 300. The revised prompt significantly improved ChatGPT's realism, clinical accuracy, and adaptability, leading to a marked reduction in scoring discrepancies. The score accuracy of ChatGPT improved 4.926 times compared to unrevised prompt. The score difference percentage (SDP) drops from 29.83% to 6.06%, with a drop in standard deviation from 0.55 to 0.068.

CONCLUSIONS

ChatGPT, as a representative LLM, is a viable tool for simulating SPs in medical assessments, with the potential to enhance medical training. By incorporating detailed and targeted prompts, ChatGPT's scoring accuracy and response realism significantly improve, approaching the feasibility of actual clinical use. However, despite promising outcomes, continuous refinement is essential to fully establish LLM’s (such as ChatGPT) reliability in clinical assessment settings.

Publisher

JMIR Publications Inc.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3