Accuracy and Reliability of Chatbot Responses to Physician Questions

Author:

Goodman Rachel S.1,Patrinely J. Randall2,Stone Cosby A.3,Zimmerman Eli4,Donald Rebecca R.5,Chang Sam S.6,Berkowitz Sean T.7,Finn Avni P.7,Jahangir Eiman8,Scoville Elizabeth A.9,Reese Tyler S.10,Friedman Debra L.11,Bastarache Julie A.3,van der Heijden Yuri F.12,Wright Jordan J.13,Ye Fei14,Carter Nicholas15,Alexander Matthew R.16,Choe Jennifer H.17,Chastain Cody A.12,Zic John A.2,Horst Sara N.9,Turker Isik18,Agarwal Rajiv17,Osmundson Evan19,Idrees Kamran20,Kiernan Colleen M.20,Padmanabhan Chandrasekhar20,Bailey Christina E.20,Schlegel Cameron E.20,Chambless Lola B.21,Gibson Michael K.17,Osterman Travis J.22,Wheless Lee E.2,Johnson Douglas B.17

Affiliation:

1. Vanderbilt University School of Medicine, Nashville, Tennessee

2. Department of Dermatology, Vanderbilt University Medical Center, Nashville, Tennessee

3. Department of Allergy, Pulmonology, and Critical Care, Vanderbilt University Medical Center, Nashville, Tennessee

4. Department of Neurology, Vanderbilt University Medical Center, Nashville, Tennessee

5. Department of Anesthesiology, Vanderbilt University Medical Center, Nashville, Tennessee

6. Department of Urology, Vanderbilt University Medical Center, Nashville, Tennessee

7. Vanderbilt Eye Institute, Department of Ophthalmology, Vanderbilt University Medical, Nashville, Tennessee

8. Department of Cardiovascular Medicine, Vanderbilt University Medical Center, Nashville, Tennessee

9. Department of Gastroenterology, Hepatology, and Nutrition, Vanderbilt University Medical Center, Nashville, Tennessee

10. Department of Rheumatology and Immunology, Vanderbilt University Medical Center, Nashville, Tennessee

11. Department of Pediatric Hematology/Oncology, Vanderbilt University Medical Center, Nashville, Tennessee

12. Department of Infectious Disease, Vanderbilt University Medical Center, Nashville, Tennessee

13. Department of Diabetes, Endocrinology, and Metabolism, Vanderbilt University Medical Center, Nashville, Tennessee

14. Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee

15. Division of Trauma and Surgical Critical Care, University of Miami Miller School of Medicine, Miami, Florida

16. Department of Cardiovascular Medicine and Clinical Pharmacology, Vanderbilt University Medical Center, Nashville, Tennessee

17. Department of Hematology/Oncology, Vanderbilt University Medical Center, Nashville, Tennessee

18. Department of Cardiology, Washington University School of Medicine in St Louis, St Louis, Missouri

19. Department of Radiation Oncology, Vanderbilt University Medical Center, Nashville, Tennessee

20. Department of Surgical Oncology & Endocrine Surgery, Vanderbilt University Medical Center, Nashville, Tennessee

21. Department of Neurological Surgery, Vanderbilt University Medical Center, Nashville, Tennessee

22. Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee

Abstract

ImportanceNatural language processing tools, such as ChatGPT (generative pretrained transformer, hereafter referred to as chatbot), have the potential to radically enhance the accessibility of medical information for health professionals and patients. Assessing the safety and efficacy of these tools in answering physician-generated questions is critical to determining their suitability in clinical settings, facilitating complex decision-making, and optimizing health care efficiency.ObjectiveTo assess the accuracy and comprehensiveness of chatbot-generated responses to physician-developed medical queries, highlighting the reliability and limitations of artificial intelligence–generated medical information.Design, Setting, and ParticipantsThirty-three physicians across 17 specialties generated 284 medical questions that they subjectively classified as easy, medium, or hard with either binary (yes or no) or descriptive answers. The physicians then graded the chatbot-generated answers to these questions for accuracy (6-point Likert scale with 1 being completely incorrect and 6 being completely correct) and completeness (3-point Likert scale, with 1 being incomplete and 3 being complete plus additional context). Scores were summarized with descriptive statistics and compared using the Mann-Whitney U test or the Kruskal-Wallis test. The study (including data analysis) was conducted from January to May 2023.Main Outcomes and MeasuresAccuracy, completeness, and consistency over time and between 2 different versions (GPT-3.5 and GPT-4) of chatbot-generated medical responses.ResultsAcross all questions (n = 284) generated by 33 physicians (31 faculty members and 2 recent graduates from residency or fellowship programs) across 17 specialties, the median accuracy score was 5.5 (IQR, 4.0-6.0) (between almost completely and complete correct) with a mean (SD) score of 4.8 (1.6) (between mostly and almost completely correct). The median completeness score was 3.0 (IQR, 2.0-3.0) (complete and comprehensive) with a mean (SD) score of 2.5 (0.7). For questions rated easy, medium, and hard, the median accuracy scores were 6.0 (IQR, 5.0-6.0), 5.5 (IQR, 5.0-6.0), and 5.0 (IQR, 4.0-6.0), respectively (mean [SD] scores were 5.0 [1.5], 4.7 [1.7], and 4.6 [1.6], respectively; P = .05). Accuracy scores for binary and descriptive questions were similar (median score, 6.0 [IQR, 4.0-6.0] vs 5.0 [IQR, 3.4-6.0]; mean [SD] score, 4.9 [1.6] vs 4.7 [1.6]; P = .07). Of 36 questions with scores of 1.0 to 2.0, 34 were requeried or regraded 8 to 17 days later with substantial improvement (median score 2.0 [IQR, 1.0-3.0] vs 4.0 [IQR, 2.0-5.3]; P < .01). A subset of questions, regardless of initial scores (version 3.5), were regenerated and rescored using version 4 with improvement (mean accuracy [SD] score, 5.2 [1.5] vs 5.7 [0.8]; median score, 6.0 [IQR, 5.0-6.0] for original and 6.0 [IQR, 6.0-6.0] for rescored; P = .002).Conclusions and RelevanceIn this cross-sectional study, chatbot generated largely accurate information to diverse medical queries as judged by academic physician specialists with improvement over time, although it had important limitations. Further research and model development are needed to correct inaccuracies and for validation.

Publisher

American Medical Association (AMA)

Subject

General Medicine

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3