Chatbots are Not Yet Safe for Emergency Care Patient Use: Deficiencies of AI Responses to Clinical Questions (Preprint)

Author:

Yau Jonathan Yi-ShinORCID,Saadat SoheilORCID,Hsu EdmundORCID,Murphy Linda Suk-LingORCID,Roh Jennifer SORCID,Suchard JeffreyORCID,Tapia Antonio,Wiechmann WarrenORCID,Langdorf Mark IORCID

Abstract

BACKGROUND

Recent surveys indicate that 58% of consumers actively use generative AI for health-related inquiries. Despite widespread adoption and potential to improve healthcare access, scant research examines the performance of AI chatbot responses regarding emergency care advice.

OBJECTIVE

We assessed the quality of AI chatbot responses to common emergency care questions. We sought to determine qualitative differences in responses from four free-access AI chatbots, for ten different serious and benign emergency conditions.

METHODS

We created 10 emergency care questions that we fed into the free-access versions of ChatGPT 3.5, Google Bard, Bing AI Chat, and Claude AI on November 26, 2023. Each response was graded by five board-certified emergency medicine (EM) faculty for eight domains of percentage accuracy, presence of dangerous information, factual accuracy, clarity, completeness, understandability, source reliability, and source relevancy. We determined the correct, complete response to the 10 questions from reputable and scholarly emergency medical references. These were compiled by an EM resident physician. For readability of the chatbot responses, we used the Fleischer-Kincaid Grade Level (FKGL) of each response from readability statistics embedded in Microsoft Word. Differences between chatbots were determined by Chi-square test.

RESULTS

Each of the four chatbots’ responses to the 10 clinical questions were scored across eight domains by five EM Faculty, for 400 assessments for each chatbot. Together, the four chatbots had the best performance in clarity and understandability (both 85%), intermediate performance in accuracy and completeness (both 50%), and poor performance (10%) for source relevance and reliability (mostly unreported). Chatbots contained dangerous information in 5-35% of responses, with no statistical difference between chatbots on this metric. ChatGPT, Google Bard, and Claud AI had similar performances across 7/8 domains. Only Bing AI performed better with more identified/relevant sources (40%, others 0-10%). Fleischer-Kincaid Reading level was 7.7-8.9 grade for all chatbots, except ChatGPT at 10.8, all too advanced for average emergency patients. Responses included both dangerous (e.g. start CPR with no pulse check) and generally inappropriate advice (e.g. loosen the collar to improve breathing without evidence of airway compromise).

CONCLUSIONS

AI Chatbots, though ubiquitous, have significant deficiencies for emergency medicine patient advice, despite relatively consistent performance. Information for when to seek urgent/emergent care is frequently incomplete and inaccurate, and patients may be unaware of misinformation. Sources are not generally provided. Patients who use AI to guide healthcare assume potential risk. AI Chatbots for health may exacerbate disparities in social determinants of health and should be subject to further research, refinement, and regulation. We strongly recommend proper medical consultation to prevent potential adverse outcomes.

Publisher

JMIR Publications Inc.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3