Comparison of ChatGPT–3.5, ChatGPT-4, and Orthopaedic Resident Performance on Orthopaedic Assessment Examinations-Reference-Cited by-同舟云学术

Comparison of ChatGPT–3.5, ChatGPT-4, and Orthopaedic Resident Performance on Orthopaedic Assessment Examinations

Published:2023-09-04 Issue:23 Volume:31 Page:1173-1179
ISSN:1067-151X
Container-title:Journal of the American Academy of Orthopaedic Surgeons
language:en
Short-container-title:J Am Acad Orthop Surg

Author:

Massey Patrick A.^ORCID,Montgomery Carver,Zhang Andrew S^ORCID

Abstract

Introduction: Artificial intelligence (AI) programs have the ability to answer complex queries including medical profession examination questions. The purpose of this study was to compare the performance of orthopaedic residents (ortho residents) against Chat Generative Pretrained Transformer (ChatGPT)-3.5 and GPT-4 on orthopaedic assessment examinations. A secondary objective was to perform a subgroup analysis comparing the performance of each group on questions that included image interpretation versus text-only questions. Methods: The ResStudy orthopaedic examination question bank was used as the primary source of questions. One hundred eighty questions and answer choices from nine different orthopaedic subspecialties were directly input into ChatGPT-3.5 and then GPT-4. ChatGPT did not have consistently available image interpretation, so no images were directly provided to either AI format. Answers were recorded as correct versus incorrect by the chatbot, and resident performance was recorded based on user data provided by ResStudy. Results: Overall, ChatGPT-3.5, GPT-4, and ortho residents scored 29.4%, 47.2%, and 74.2%, respectively. There was a difference among the three groups in testing success, with ortho residents scoring higher than ChatGPT-3.5 and GPT-4 (P < 0.001 and P < 0.001). GPT-4 scored higher than ChatGPT-3.5 (P = 0.002). A subgroup analysis was performed by dividing questions into question stems without images and question stems with images. ChatGPT-3.5 was more correct (37.8% vs. 22.4%, respectively, OR = 2.1, P = 0.033) and ChatGPT-4 was also more correct (61.0% vs. 35.7%, OR = 2.8, P < 0.001), when comparing text-only questions versus questions with images. Residents were 72.6% versus 75.5% correct with text-only questions versus questions with images, with no significant difference (P = 0.302). Conclusion: Orthopaedic residents were able to answer more questions accurately than ChatGPT-3.5 and GPT-4 on orthopaedic assessment examinations. GPT-4 is superior to ChatGPT-3.5 for answering orthopaedic resident assessment examination questions. Both ChatGPT-3.5 and GPT-4 performed better on text-only questions than questions with images. It is unlikely that GPT-4 or ChatGPT-3.5 would pass the American Board of Orthopaedic Surgery written examination.

Publisher

Ovid Technologies (Wolters Kluwer Health)

Subject

Orthopedics and Sports Medicine,Surgery

Reference13 articles.

1. ELIZA—a computer program for the study of natural language communication between man and machine;Weizenbaum;Commun ACM,1966

2. Chatbot breakthrough in the 2020s? An ethical reflection on the trend of automated consultations in health care;Parviainen;Med Health Care Philos,2022

3. The appropriation of conversational AI in the workplace: A taxonomy of AI chatbot users;Gkinko;Int J Inf Manag,2023

4. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models;Kung;PLOS Digit Health,2023

5. Chatbots: History, technology, and applications;Adamopoulou;Machine Learn Appl,2020

Cited by 5 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Could ChatGPT-4 pass an anaesthesiology board examination? Follow-up assessment of a comprehensive set of board examination practice questions;British Journal of Anaesthesia;2024-01

2. Human Programmers versus ChatGPT 3.5 & 4.0: A Comparison of Coding in Korean;Journal of Digital Contents Society;2023-12-31

3. ChatGPT Performance in Diagnostic Clinical Microbiology Laboratory-Oriented Case Scenarios;Cureus;2023-12-16

4. Accuracy of GPT's artificial intelligence on emergency medicine board recertification exam;The American Journal of Emergency Medicine;2023-12

5. AI Chatbots’ Medical Hallucination: Innovation of References Hallucination Score and Comparison of Six Large Language Models (Preprint);2023-11-06