Abstract
AbstractThe release of GPT-3.5-turbo (ChatGPT) and other large language models (LLMs) has the potential to transform healthcare. However, existing research evaluating LLM performance on real-world clinical notes is limited. Here, we conduct a highly-powered study to determine whether GPT-3.5-turbo can provide clinical recommendations for three tasks (admission status, radiological investigation(s) request status, and antibiotic prescription status) using clinical notes from the Emergency Department. We randomly select 10,000 Emergency Department visits to evaluate the accuracy of zero-shot, GPT-3.5-turbo-generated clinical recommendations across four different prompting strategies. We find that GPT-3.5-turbo performs poorly compared to a resident physician, with accuracy scores 24% lower on average. GPT-3.5-turbo tended to be overly cautious in its recommendations, with high sensitivity at the cost of specificity. Our findings demonstrate that, while early evaluations of the clinical use of LLMs are promising, LLM performance must be significantly improved before their deployment as decision support systems for clinical recommendations and other complex tasks.
Publisher
Cold Spring Harbor Laboratory
Reference29 articles.
1. Hu K , Hu K . ChatGPT sets record for fastest-growing user base - analyst note. Reuters. https://www.reuters.com/technology/chatgpt-sets-record-fastest-growing-user-base-analyst-note-2023-02-01/. Published February 2, 2023. Accessed August 7, 2023.
2. GPT-4. Accessed August 7, 2023. https://openai.com/gpt-4
3. Appropriateness of Cardiovascular Disease Prevention Recommendations Obtained From a Popular Online Chat-Based Artificial Intelligence Model
4. Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献