Rest in Peace Dr. Search Engines, and Long Live Dr. Chat GPT (Preprint)

Author:

Şahin Bahadır,Doğan Kader,Genç YunusORCID,Şener Tarık EmreORCID,Şekerci Çağrı Akın,Tanıdır Yılören,Yücel Selçuk,Tarcan Tufan

Abstract

BACKGROUND

In the modern era of healthcare, advancements in artificial intelligence and natural language processing have allowed for the development of chatbots that can assist in answering patient questions and providing healthcare information [1]. One such chatbot is ChatGPT, a large language model trained by Open AI. ChatGPT can provide accurate and relevant information in various medical specialties, including urology. However, the reliability and accuracy of ChatGPT’s information compared to those provided by human specialists have yet to be thoroughly investigated, neither for version 3.5 nor version 4. One such Previous study have shown that chatbots can be a valuable resource for patients seeking healthcare information. However, concerns have been raised regarding the accuracy and reliability of the information provided by these systems. [2] As such, it is essential to evaluate the performance of chatbots compared to human specialists to determine their potential role in healthcare.

OBJECTIVE

Our study compares two ChatGPT versions’ behaviors, knowledge and interpretation capacity based on an Academic Institution’s expert healthcare providers’ sights and approaches, European Association of Urology Guidelines, and literature reviews by this 3 staged survey.

METHODS

Our study has three main steps to evaluate the effectiveness of ChatGPT in the urologic field. We generated 35 questions on urology that were extracted from our institution’s experts who have at least 10 years of experience in their fields, such as Andrology, Pediatric Urology, Functional Urology, Endourology, and Urooncology. Study data were collected and managed using REDCap [3,4] electronic data capture tools licensed to Urology Department of Marmara University, School of Medicine. All responses were evaluated for consistency with the 2023 European Urological Society Guidelines and were double-checked by another academic expert. Then we created an answer key, and after each question was posed to ChatGPT version 3.5 and ChatGPT version 4 separately. The answers of two ChatGPT versions were compared with chi-square Fisher’s Exact Test. This comparison was used to assess the reliability of ChatGPT versions’ usage on a medical care providers’ clinical practice habits and decision-making, high-quality evidence, and provide responses by evidence-based medicine (EBM) The next step aimed to assess the reliability of ChatGPT versions about the current debate topics, even between the academic urologists and mentors of the field. As the debate questions do not have an absolute answer, the success rate and approach of chatGPT were assessed regarding the experts’ most common opinions. In this context, we prepared a total of 15 "Debate Questions" for the experturologists who work at Marmara University, School of Medicine, and collected the answers via an online survey. The consistent answers between the health care providers were assessed and compared with ChatGPT versions’ answers. For this comparision, Fisher’s exact test was used. The last part of the study was prepared to assess the reliability of ChatGPT versions’ recommendations and directives that were most asked by patients on the internet. Those ten questions were generated after an interview with healthcare professionals and patients admitted to our outpatient clinic. ChatGPT versions 3.5 and 4 were asked those questions separately, and their answers and medical directions were noted. Afterwards, the answers were assessed by the academic urologists scoring them from 0 to 10 Our three staged survey was analyzed both quantitative and qualitatively. Comparison with Chi-Square tests for all steps is made by IBM SPSS for Statistics for Windows, Version 27.0 (IBM Corp. Released 2020. IBM SPSS Statistics for Windows, Version 27.0. Armonk, NY: IBM Corp)

RESULTS

The overall success rate between ChatGPT version 3,5 versus version 4 was different at statistically significant level, in favour of version 4 (p=0,022) regarding the first step of our study (Table 2) Version 4 provided correct answers to 25 questions out of 35 while version 3,5 provided only 19(%71,4 vs %54).The questions which were answered incorrectly by both versions were measuring clinical decision-making and experience in general. The second step of our study was prepared to see the reaction of chatGPT versions on debate situations in urology.13 questions out of 15 were replied to with the same answers by %50 or more of the experts, therefore we assessed those answers as references because an answer key actually did not exist due to medical care providers’ different clinical approaches. 5 out of those 13 reference questions were replied the same by ChatGPT v 3,5 (%38,4) while ChatGPT v 4 replied the same 3 out of 13 (%23), the P value was 0,510 (Table 3) The last step was assessing the ChatGPT’s recommendations and guidance on patients’ commonly asked ten questions by scoring each from 0 to 10. Mean value for version 3.5 was 8.6 while version 4’s was 8.2 out of 10 points. The Fisher’s Exact Test did not demonstrate any statistical difference between the ChatGPT versions in informing patients (Table 1). Also according to the experts’ ratings on Step 3, Table 1; the success rate seemed sufficient enough.

CONCLUSIONS

Our study showed that ChatGPT version 3.5 and version 4 were both successful in informing patients and providing medical directions but not directly providing treatment options or experience-based decision-making. The significance between the two versions of ChatGPT for the first step’s 35 questions was thought to be due to the improvement of ChatGPT’s literature and data synthesis abilities. The insignificance between the two ChatGPT versions’ answers in the second and third steps was thought to be due to a lack of patient follow-up and see real-world experience. The competency of ChatGPT on medical licensing exams has been demonstrated by a recent study. [5] Our study also showed the competency of ChatGPT in answering general knowledge questions.Information is now incredibly accessible and readily available in the age of artificial intelligence. A chatbot is a computer program that uses artificial intelligence (AI) and natural language processing (NLP) to understand questions and automate responses to them, simulating human conversation. These technologies rely on machine learning and deep learning —elements of AI, with some nuanced differences to develop an increasingly granular knowledge base of questions and responses that are based on user interactions. This improves their ability to predict user needs accurately and respond correctly over time. Considering the burnout of physicians all over the world, it is essential to reach good quality information both from literature and guidelines without taking away much time. [6] There have been several studies comparing AI’s and real doctors[7], but for the time being it is a better option to use it to ease healthcare services provided by professionals and support clinical practice such as creating discharge summaries, triaging patients and patient information forms etc. [7,8] Another issue is to give patients a high-level service about medical systems and direct referee them to the healthcare they need, even when the medical system is overloaded. On the other end of the scale, there are concerns about its utilization in real-world situations and also ethical issues regarding patient data sharing and scanning.[9] It has previously been shown that the ChatGPT, brought wrong information and misreferred the resources [10] In this study we demonstrated that the responses gathered from both versions of ChatGPT was not always in concordance with the responses of the experts in the area. In order to that, the lack of field competency caused the variable and inappropriate results. And make scientists think the fact that versions’ development is crucial in guideline-based information based on our study’s first two steps. Also it should be noted that AI is not always competent enough to assess patients medical and socioeconomic conditions. It also lacks information about the availability of therapeutic and diagnostic tools at the selected center. This lack of information may cause AI to give unrealistic or unfit responses which in turn may cause a trust issue between patients and physicians. [11] Our study illustrates that, it may be a logical approach to use ChatGPT versions as an assistant for routine daily healthcare services but ChatGPT should not be used to plan the treatment of patients nor to replace doctors. The patient information system may be developed even further and may be much more beneficial for high-volume centers to provide high-quality patient care while saving man force and time. Therefore, as a free chatbot “ChatGPT version 3.5” is competent enough to inform for non- heathcare professional individuals seeking practical and fast medical directions. It may also be beneficial for healthcare professionals to use ChatGPT verison 4 for well-known guideline-based information since it achived a high score in step 1 of this study. There are several limitations of this study that need to be acknowledged. First, conditions are assessed in just one language(English). The real-world experience of patient counselling requires a mother-language engine to see if people could or not impress themselves correctly and understand the directives given by chatbots. Also, another limitation is that the reference answers were only based on seven urology experts in a single center, expanding the observers to multiple may change the results as mentioned before.[12] This research has encourages academic institutions to further investigation AI and expanding the sample size. We believe that AI will ease healthcare providers’ workload and may prevent burnout of physicians while assisting them under professional supervision. But our results clearly demonstrate that the accuracy and security of the information obtained with generative AI models should be strictly controlled, and the decisions taken during the patient treatment and follow-up process should never solely based on the information obtained with the AI.

Publisher

JMIR Publications Inc.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3