Rest in Peace Dr. Search Engines, and Long Live Dr. Chat GPT (Preprint)
Author:
Şahin Bahadır, Doğan Kader, Genç YunusORCID, Şener Tarık EmreORCID, Şekerci Çağrı Akın, Tanıdır Yılören, Yücel Selçuk, Tarcan Tufan
Abstract
BACKGROUND
In the modern era of healthcare, advancements in artificial intelligence and natural language
processing have allowed for the development of chatbots that can assist in answering patient
questions and providing healthcare information [1].
One such chatbot is ChatGPT, a large language model trained by Open AI. ChatGPT can provide
accurate and relevant information in various medical specialties, including urology.
However, the reliability and accuracy of ChatGPT’s information compared to those provided by human specialists have yet to be thoroughly investigated, neither for version 3.5 nor version 4.
One such Previous study have shown that chatbots can be a valuable resource for patients seeking healthcare information. However, concerns have been raised regarding the accuracy and reliability of the information provided by these systems. [2] As such, it is essential to evaluate the performance of chatbots compared to human specialists to determine their potential role in healthcare.
OBJECTIVE
Our study compares two ChatGPT versions’ behaviors, knowledge and interpretation capacity based on an Academic Institution’s expert healthcare providers’ sights and approaches, European Association of Urology Guidelines, and literature reviews by this 3 staged survey.
METHODS
Our study has three main steps to evaluate the effectiveness of ChatGPT in the urologic field.
We generated 35 questions on urology that were extracted from our institution’s experts who have at least 10 years of experience in their fields, such as Andrology, Pediatric Urology, Functional Urology, Endourology, and Urooncology.
Study data were collected and managed using REDCap [3,4] electronic data capture tools licensed to Urology Department of Marmara University, School of Medicine.
All responses were evaluated for consistency with the 2023 European Urological Society
Guidelines and were double-checked by another academic expert. Then we created an answer key, and after each question was posed to ChatGPT version 3.5 and ChatGPT version 4 separately.
The answers of two ChatGPT versions were compared with chi-square Fisher’s Exact Test.
This comparison was used to assess the reliability of ChatGPT versions’ usage on a medical care providers’ clinical practice habits and decision-making, high-quality evidence, and provide responses by evidence-based medicine (EBM) The next step aimed to assess the reliability of ChatGPT versions about the current debate topics, even between the academic urologists and mentors of the field. As the debate questions do not have an absolute answer, the success rate and approach of chatGPT were assessed regarding the experts’ most common opinions. In this context, we prepared a total of 15 "Debate Questions" for the experturologists who work at Marmara University, School of Medicine, and collected the answers via an online survey.
The consistent answers between the health care providers were assessed and compared with
ChatGPT versions’ answers. For this comparision, Fisher’s exact test was used.
The last part of the study was prepared to assess the reliability of ChatGPT versions’
recommendations and directives that were most asked by patients on the internet.
Those ten questions were generated after an interview with healthcare professionals and patients admitted to our outpatient clinic. ChatGPT versions 3.5 and 4 were asked those questions separately, and their answers and medical directions were noted.
Afterwards, the answers were assessed by the academic urologists scoring them from 0 to 10
Our three staged survey was analyzed both quantitative and qualitatively.
Comparison with Chi-Square tests for all steps is made by IBM SPSS for Statistics for Windows, Version 27.0 (IBM Corp. Released 2020. IBM SPSS Statistics for Windows, Version 27.0. Armonk, NY: IBM Corp)
RESULTS
The overall success rate between ChatGPT version 3,5 versus version 4 was different at statistically
significant level, in favour of version 4 (p=0,022) regarding the first step of our study (Table 2) Version 4 provided correct answers to 25 questions out of 35 while version 3,5 provided only 19(%71,4 vs %54).The questions which were answered incorrectly by both versions were measuring clinical decision-making and experience in general.
The second step of our study was prepared to see the reaction of chatGPT versions on debate
situations in urology.13 questions out of 15 were replied to with the same answers by %50 or more of the experts, therefore we assessed those answers as references because an answer key actually did not exist due to medical care providers’ different clinical approaches.
5 out of those 13 reference questions were replied the same by ChatGPT v 3,5 (%38,4) while
ChatGPT v 4 replied the same 3 out of 13 (%23), the P value was 0,510 (Table 3)
The last step was assessing the ChatGPT’s recommendations and guidance on patients’ commonly asked ten questions by scoring each from 0 to 10. Mean value for version 3.5 was 8.6 while version 4’s was 8.2 out of 10 points. The Fisher’s Exact Test did not demonstrate any statistical difference between the ChatGPT versions in informing patients (Table 1).
Also according to the experts’ ratings on Step 3, Table 1; the success rate seemed sufficient enough.
CONCLUSIONS
Our study showed that ChatGPT version 3.5 and version 4 were both successful in informing
patients and providing medical directions but not directly providing treatment options or
experience-based decision-making. The significance between the two versions of ChatGPT for the first step’s 35 questions was thought to be due to the improvement of ChatGPT’s literature and data synthesis abilities. The insignificance between the two ChatGPT versions’ answers in the second and third steps was thought to be due to a lack of patient follow-up and see real-world experience.
The competency of ChatGPT on medical licensing exams has been demonstrated by a recent study. [5] Our study also showed the competency of ChatGPT in answering general knowledge questions.Information is now incredibly accessible and readily available in the age of artificial intelligence. A chatbot is a computer program that uses artificial intelligence (AI) and natural language processing (NLP) to understand questions and automate responses to them, simulating human conversation. These technologies rely on machine learning and deep learning —elements of AI, with some nuanced differences to develop an increasingly granular knowledge base of questions and responses that are based on user interactions. This improves their ability to predict user needs accurately and
respond correctly over time.
Considering the burnout of physicians all over the world, it is essential to reach good quality
information both from literature and guidelines without taking away much time. [6]
There have been several studies comparing AI’s and real doctors[7], but for the time being it is a better option to use it to ease healthcare services provided by professionals and support clinical practice such as creating discharge summaries, triaging patients and patient information forms etc. [7,8] Another issue is to give patients a high-level service about medical systems and direct referee them to the healthcare they need, even when the medical system is overloaded.
On the other end of the scale, there are concerns about its utilization in real-world situations and also ethical issues regarding patient data sharing and scanning.[9]
It has previously been shown that the ChatGPT, brought wrong information and misreferred
the resources [10] In this study we demonstrated that the responses gathered from both versions of ChatGPT was not always in concordance with the responses of the experts in the area. In order to that, the lack of field competency caused the variable and inappropriate results. And make scientists think the fact that versions’ development is crucial in guideline-based information based on our study’s first two steps.
Also it should be noted that AI is not always competent enough to assess patients medical and
socioeconomic conditions. It also lacks information about the availability of therapeutic and
diagnostic tools at the selected center. This lack of information may cause AI to give unrealistic or unfit responses which in turn may cause a trust issue between patients and physicians. [11]
Our study illustrates that, it may be a logical approach to use ChatGPT versions as an assistant for routine daily healthcare services but ChatGPT should not be used to plan the treatment of patients nor to replace doctors. The patient information system may be developed even further and may be much more beneficial for high-volume centers to provide high-quality patient care while saving man force and time.
Therefore, as a free chatbot “ChatGPT version 3.5” is competent enough to inform for non-
heathcare professional individuals seeking practical and fast medical directions. It may also be
beneficial for healthcare professionals to use ChatGPT verison 4 for well-known guideline-based information since it achived a high score in step 1 of this study.
There are several limitations of this study that need to be acknowledged. First, conditions are
assessed in just one language(English). The real-world experience of patient counselling requires a mother-language engine to see if people could or not impress themselves correctly and understand the directives given by chatbots.
Also, another limitation is that the reference answers were only based on seven urology experts in a single center, expanding the observers to multiple may change the results as mentioned before.[12]
This research has encourages academic institutions to further investigation AI and expanding the sample size. We believe that AI will ease healthcare providers’ workload and may prevent burnout of physicians while assisting them under professional supervision. But our results clearly demonstrate that the accuracy and security of the information obtained with generative AI models should be strictly controlled, and the decisions taken during the patient treatment and follow-up process should never solely based on the information obtained with the AI.
Publisher
JMIR Publications Inc.
|
|