Performance of large language models on benign prostatic hyperplasia frequently asked questions-Reference-Cited by-同舟云学术

Performance of large language models on benign prostatic hyperplasia frequently asked questions

Published:2024-04 Issue:9 Volume:84 Page:807-813
ISSN:0270-4137
Container-title:The Prostate
language:en
Short-container-title:The Prostate

Author:

Zhang YuNing¹²,Dong Yijie¹²,Mei Zihan¹²,Hou Yiqing¹²,Wei Minyan¹²^ORCID,Yeung Yat Hin¹²,Xu Jiale¹²,Hua Qing¹²,Lai LiMei¹²,Li Ning³^ORCID,Xia ShuJun¹²,Zhou Chun¹²,Zhou JianQiao¹²^ORCID

Affiliation:

1. Department of Ultrasound, Ruijin Hospital Shanghai Jiaotong University School of Medicine Shanghai China

2. College of Health Science and Technology Shanghai Jiao Tong University School of Medicine Shanghai China

3. Department of Ultrasound, Yunnan Kungang Hospital The Seventh Affiliated Hospital of Dali University Anning Yunnan China

Abstract

AbstractBackgroundBenign prostatic hyperplasia (BPH) is a common condition, yet it is challenging for the average BPH patient to find credible and accurate information about BPH. Our goal is to evaluate and compare the accuracy and reproducibility of large language models (LLMs), including ChatGPT‐3.5, ChatGPT‐4, and the New Bing Chat in responding to a BPH frequently asked questions (FAQs) questionnaire.MethodsA total of 45 questions related to BPH were categorized into basic and professional knowledge. Three LLM—ChatGPT‐3.5, ChatGPT‐4, and New Bing Chat—were utilized to generate responses to these questions. Responses were graded as comprehensive, correct but inadequate, mixed with incorrect/outdated data, or completely incorrect. Reproducibility was assessed by generating two responses for each question. All responses were reviewed and judged by experienced urologists.ResultsAll three LLMs exhibited high accuracy in generating responses to questions, with accuracy rates ranging from 86.7% to 100%. However, there was no statistically significant difference in response accuracy among the three (p > 0.017 for all comparisons). Additionally, the accuracy of the LLMs' responses to the basic knowledge questions was roughly equivalent to that of the specialized knowledge questions, showing a difference of less than 3.5% (GPT‐3.5: 90% vs. 86.7%; GPT‐4: 96.7% vs. 95.6%; New Bing: 96.7% vs. 93.3%). Furthermore, all three LLMs demonstrated high reproducibility, with rates ranging from 93.3% to 97.8%.ConclusionsChatGPT‐3.5, ChatGPT‐4, and New Bing Chat offer accurate and reproducible responses to BPH‐related questions, establishing them as valuable resources for enhancing health literacy and supporting BPH patients in conjunction with healthcare professionals.

Funder

National Natural Science Foundation of China

Publisher

Wiley

Link

https://onlinelibrary.wiley.com/doi/pdf/10.1002/pros.24699

Reference9 articles.

1. The global burden of lower urinary tract symptoms suggestive of benign prostatic hyperplasia: A systematic review and meta-analysis

2. What is a disease? What is the disease clinical benign prostatic hyperplasia (BPH)?

3. The Informed Patient

4. How Readable Is BPH Treatment Information on the Internet? Assessing Barriers to Literacy in Prostate Health

5. Language models are few‐shot learners;Brown T;Adv Neural Inf Process Syst,2020

Cited by 3 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Large language models and benign prostatic hyperplasia frequently asked questions;The Prostate;2024-05-16

2. Responses to queries concerning “Performance of large language models on benign prostatic hyperplasia frequently asked questions”;The Prostate;2024-05-16

3. Use of artificial intelligence chatbots in clinical management of immune-related adverse events;Journal for ImmunoTherapy of Cancer;2024-05