AI in Medical Education: A Comparative Analysis of GPT-4 and GPT-3.5 on Turkish Medical Specialization Exam Performance-Reference-Cited by-同舟云学术

AI in Medical Education: A Comparative Analysis of GPT-4 and GPT-3.5 on Turkish Medical Specialization Exam Performance

Published:2023-07-12 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Kılıç Mustafa Eray^ORCID

Abstract

AbstractBackground/aimLarge-scale language models (LLMs), such as GPT-4 and GPT-3.5, have demonstrated remarkable potential in the rapidly developing field of artificial intelligence (AI) in education. The use of these models in medical education, especially their effectiveness in situations such as the Turkish Medical Specialty Examination (TUS), is yet understudied. This study evaluates how well GPT-4 and GPT-3.5 respond to TUS questions, providing important insight into the real-world uses and difficulties of AI in medical education.Materials and methodsIn the study, 1440 medical questions were examined using data from six Turkish Medical Specialties examinations. GPT-4 and GPT-3.5 AI models were utilized to provide answers, and IBM SPSS 26.0 software was used for data analysis. For advanced enquiries, correlation analysis and regression analysis were used.ResultsGPT-4 demonstrated a better overall success rate (70.56%) than GPT-3.5 (40.17%) and physicians (38.14%) in this study examining the competency of GPT-4 and GPT-3.5 in answering questions from the Turkish Medical Specialization Exam (TUS). Notably, GPT-4 delivered more accurate answers and made fewer errors than GPT-3.5, yet the two models skipped about the same number of questions. Compared to physicians, GPT-4 produced more accurate answers and a better overall score. In terms of the number of accurate responses, GPT-3.5 performed slightly better than physicians. Between GPT-4 and GPT-3.5, GPT-4 and the doctors, and GPT-3.5 and the doctors, the success rates varied dramatically. Performance ratios differed across domains, with doctors outperforming AI in tests involving anatomy, whereas AI models performed best in tests involving pharmacology.ConclusionsIn this study, GPT-4 and GPT-3.5 AI models showed superior performance in answering Turkish Medical Specialization Exam questions. Despite their abilities, these models demonstrated limitations in reasoning beyond given knowledge, particularly in anatomy. The study recommends adding AI support to medical education to enhance the critical interaction with these technologies.

Publisher

Cold Spring Harbor Laboratory

Reference26 articles.

1. Kusunose K. Revolution of echocardiographic reporting: the new era of artificial intelligence and natural language processing. J Echocardiogr. 2023 Jun 13 https://doi.org/10.1007/s12574-023-00611-1

2. Cheng K , Guo Q , He Y , Lu Y , Gu S , Wu H . Exploring the Potential of GPT-4 in Biomedical Engineering: The Dawn of a New Era. Ann Biomed Eng. 2023 Apr 28; https://doi.org/10.1007/s10439-023-03221-1

3. Applications and Challenges of Implementing Artificial Intelligence in Medical Education: Integrative Review;JMIR Med Educ,2019

4. J. Qadir , “Engineering Education in the Era of ChatGPT: Promise and Pitfalls of Generative AI for Education,” 2023 IEEE Global Engineering Education Conference (EDUCON), Kuwait, Kuwait, 2023, pp. 1–9, https://doi.org/10.1109/EDUCON54358.2023.10125121

5. Farrokhnia M , Banihashem SK , Noroozi O , Wals A . A SWOT analysis of ChatGPT: Implications for educational practice and research. Innovations in Education and Teaching International. 2023 Mar 27; https://doi.org/10.1080/14703297.2023.2195846