BACKGROUND
With the increasing interest of Large Language Models’ (LLMs) application in the medical field, the feasibility of its potential usage as a Standardized Patient (SP) in medical assessment is rarely evaluated. Specifically, we delved into the potential of using ChatGPT, a representative LLM, in transforming medical education by serving as a cost-effective alternative to SPs, specifically for history-taking tasks.
OBJECTIVE
The study aims to explore ChatGPT's viability and performance as an SP, employing prompt engineering to refine its accuracy and utility in medical assessments.
METHODS
A two-phase experiment was designed to assess ChatGPT's viability as an SP in medical education. The first phase tested the feasibility through simulating conversations on Inflammatory Bowel Disease (IBD), categorizing responses into poor, medium, and good inquiries based on relevance and accuracy. For the second phase, a more structured experiment used detailed scripts to evaluate ChatGPT's performance against specific criteria, focusing on its anthropomorphism, clinical accuracy, and adaptability. Adjustments were made to prompts based on ChatGPT's response shortcomings, with a comparative analysis of ChatGPT’s performance between original and revised prompts to track improvements. The methodology included statistical analysis to ensure rigorous evaluation, with data collected between November and December 2023.
RESULTS
The feasibility test confirmed ChatGPT's ability to simulate an SP effectively, differentiating between poor, medium, and good medical inquiries with varying degrees of accuracy. Score differences between the poor (74.7, SD=5.44) and medium (82.67, SD=5.30) inquiry groups (P < .001), between the poor and good (85, SD=3.27) inquiry groups (P < .001) were significant at a significance level of α = .05, while the score differences between the medium and good inquiry groups were not statistically significant (P= .158). The feasibility test took 90 runs. However, the performance is not ideal without proper prompt restriction. Subsequent performance enhancements, including the use of revised prompts, instructed ChatGPT to avoid medical jargon for realism, provide accurate and concise responses for clinical accuracy, and improve its grading accuracy and adaptability by following specific prompts. The total number of trials in the second experimental phase was 300. The revised prompt significantly improved ChatGPT's realism, clinical accuracy, and adaptability, leading to a marked reduction in scoring discrepancies. The score accuracy of ChatGPT improved 4.926 times compared to unrevised prompt. The score difference percentage (SDP) drops from 29.83% to 6.06%, with a drop in standard deviation from 0.55 to 0.068.
CONCLUSIONS
ChatGPT, as a representative LLM, is a viable tool for simulating SPs in medical assessments, with the potential to enhance medical training. By incorporating detailed and targeted prompts, ChatGPT's scoring accuracy and response realism significantly improve, approaching the feasibility of actual clinical use. However, despite promising outcomes, continuous refinement is essential to fully establish LLM’s (such as ChatGPT) reliability in clinical assessment settings.