Dr. Google to Dr. ChatGPT: assessing the content and quality of artificial intelligence-generated medical information on appendicitis-Reference-Cited by-同舟云学术

Dr. Google to Dr. ChatGPT: assessing the content and quality of artificial intelligence-generated medical information on appendicitis

Published:2024-03-05 Issue:5 Volume:38 Page:2887-2893
ISSN:0930-2794
Container-title:Surgical Endoscopy
language:en
Short-container-title:Surg Endosc

Author:

Ghanem Yazid K.^ORCID,Rouhi Armaun D.,Al-Houssan Ammr,Saleh Zena,Moccia Matthew C.,Joshi Hansa,Dumon Kristoffel R.,Hong Young,Spitz Francis,Joshi Amit R.,Kwiatt Michael

Abstract

Abstract Introduction Generative artificial intelligence (AI) chatbots have recently been posited as potential sources of online medical information for patients making medical decisions. Existing online patient-oriented medical information has repeatedly been shown to be of variable quality and difficult readability. Therefore, we sought to evaluate the content and quality of AI-generated medical information on acute appendicitis. Methods A modified DISCERN assessment tool, comprising 16 distinct criteria each scored on a 5-point Likert scale (score range 16–80), was used to assess AI-generated content. Readability was determined using the Flesch Reading Ease (FRE) and Flesch-Kincaid Grade Level (FKGL) scores. Four popular chatbots, ChatGPT-3.5 and ChatGPT-4, Bard, and Claude-2, were prompted to generate medical information about appendicitis. Three investigators independently scored the generated texts blinded to the identity of the AI platforms. Results ChatGPT-3.5, ChatGPT-4, Bard, and Claude-2 had overall mean (SD) quality scores of 60.7 (1.2), 62.0 (1.0), 62.3 (1.2), and 51.3 (2.3), respectively, on a scale of 16–80. Inter-rater reliability was 0.81, 0.75, 0.81, and 0.72, respectively, indicating substantial agreement. Claude-2 demonstrated a significantly lower mean quality score compared to ChatGPT-4 (p = 0.001), ChatGPT-3.5 (p = 0.005), and Bard (p = 0.001). Bard was the only AI platform that listed verifiable sources, while Claude-2 provided fabricated sources. All chatbots except for Claude-2 advised readers to consult a physician if experiencing symptoms. Regarding readability, FKGL and FRE scores of ChatGPT-3.5, ChatGPT-4, Bard, and Claude-2 were 14.6 and 23.8, 11.9 and 33.9, 8.6 and 52.8, 11.0 and 36.6, respectively, indicating difficulty readability at a college reading skill level. Conclusion AI-generated medical information on appendicitis scored favorably upon quality assessment, but most either fabricated sources or did not provide any altogether. Additionally, overall readability far exceeded recommended levels for the public. Generative AI platforms demonstrate measured potential for patient education and engagement about appendicitis.

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1007/s00464-024-10739-5.pdf

Reference32 articles.

1. Duarte F (2024) Number of ChatGPT users. Exploding Topics. https://explodingtopics.com/blog/chatgpt-users

2. Shah NH, Entwistle DA, Pfeffer M (2023) Creation and adoption of large language models in medicine. JAMA 330(9):866. https://doi.org/10.1001/jama.2023.14217

3. Ron L, Kumar A, Chen J (2023) How chatbots and large language model artificial intelligence systems will reshape modern medicine. JAMA Intern Med 183(6):596. https://doi.org/10.1001/jamainternmed.2023.1835

4. Kirchner GJ, Kim RY, Weddle J, Bible JE (2023) Can artificial intelligence improve the readability of patient education materials? Clin Orthop Relat Res 481(11):2260–2267. https://doi.org/10.1097/corr.0000000000002668

5. Rouhi AD, Ghanem YK, Hoeltzel GD et al (2022) Online resources for patients considering hiatal hernia repair: a quality and readability analysis. J Gastrointest Surg 27(3):598–600. https://doi.org/10.1007/s11605-022-05460-4

Cited by 10 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Accuracy and Readability of Artificial Intelligence Chatbot Responses to Vasectomy-Related Questions: Public Beware;Cureus;2024-08-28

2. Enhancing Health Literacy: Evaluating the Readability of Patient Handouts Revised by ChatGPT's Large Language Model;Otolaryngology–Head and Neck Surgery;2024-08-06

3. Reference Hallucination Score for Medical Artificial Intelligence Chatbots: Development and Usability Study;JMIR Medical Informatics;2024-07-31

4. Effectiveness of Various General large language models in Clinical Consensus and Case Analysis in Dental Implantology: A Comparative Study;2024-07-29

5. Comparative Analysis of ChatGPT and Google Gemini in the Creation of Patient Education Materials for Acute Appendicitis, Cholecystitis, and Hydrocele;Indian Journal of Surgery;2024-07-03