Performance evaluation of ChatGPT, GPT-4, and Bard on the official board examination of the Japan Radiology Society-Reference-Cited by-同舟云学术

Performance evaluation of ChatGPT, GPT-4, and Bard on the official board examination of the Japan Radiology Society

Published:2023-10-04 Issue:2 Volume:42 Page:201-207
ISSN:1867-1071
Container-title:Japanese Journal of Radiology
language:en
Short-container-title:Jpn J Radiol

Author:

Toyama Yoshitaka^ORCID,Harigai Ayaka,Abe Mirei,Nagano Mitsutoshi,Kawabata Masahiro,Seki Yasuhiro,Takase Kei

Abstract

Abstract Purpose Herein, we assessed the accuracy of large language models (LLMs) in generating responses to questions in clinical radiology practice. We compared the performance of ChatGPT, GPT-4, and Google Bard using questions from the Japan Radiology Board Examination (JRBE). Materials and methods In total, 103 questions from the JRBE 2022 were used with permission from the Japan Radiological Society. These questions were categorized by pattern, required level of thinking, and topic. McNemar’s test was used to compare the proportion of correct responses between the LLMs. Fisher’s exact test was used to assess the performance of GPT-4 for each topic category. Results ChatGPT, GPT-4, and Google Bard correctly answered 40.8% (42 of 103), 65.0% (67 of 103), and 38.8% (40 of 103) of the questions, respectively. GPT-4 significantly outperformed ChatGPT by 24.2% (p < 0.001) and Google Bard by 26.2% (p < 0.001). In the categorical analysis by level of thinking, GPT-4 correctly answered 79.7% of the lower-order questions, which was significantly higher than ChatGPT or Google Bard (p < 0.001). The categorical analysis by question pattern revealed GPT-4’s superiority over ChatGPT (67.4% vs. 46.5%, p = 0.004) and Google Bard (39.5%, p < 0.001) in the single-answer questions. The categorical analysis by topic revealed that GPT-4 outperformed ChatGPT (40%, p = 0.013) and Google Bard (26.7%, p = 0.004). No significant differences were observed between the LLMs in the categories not mentioned above. The performance of GPT-4 was significantly better in nuclear medicine (93.3%) than in diagnostic radiology (55.8%; p < 0.001). GPT-4 also performed better on lower-order questions than on higher-order questions (79.7% vs. 45.5%, p < 0.001). Conclusion ChatGPTplus based on GPT-4 scored 65% when answering Japanese questions from the JRBE, outperforming ChatGPT and Google Bard. This highlights the potential of using LLMs to address advanced clinical questions in the field of radiology in Japan.

Publisher

Springer Science and Business Media LLC

Subject

Radiology, Nuclear Medicine and imaging

Link

https://link.springer.com/content/pdf/10.1007/s11604-023-01491-2.pdf

Reference24 articles.

1. Usage statistics of content languages for websites. https://w3techs.com/technologies/overview/content_language

2. Japan Radiological Society. http://www.radiology.jp

3. Bard-Chat based AI tool from Google, powered by PaLM 2. https://bard.google.com

4. Ali R, Tang OY, Connolly ID, Fridley JS, Shin JH, Zadnik Sullivan PL, et al. Performance of ChatGPT, GPT-4, and Google Bard on a neurosurgery oral boards preparation question bank. Neurosurgery. 2023. https://doi.org/10.1227/neu.0000000000002551.

5. Alkaissi H, McFarlane SI. Artificial hallucinations in ChatGPT: implications in scientific writing. Cureus. 2023;15:e35179. https://doi.org/10.7759/cureus.35179.

Cited by 41 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Prompting GPT –4 to support automatic safety case generation;Expert Systems with Applications;2024-12

2. Assessing knowledge about medical physics in language-generative AI with large language model: using the medical physicist exam;Radiological Physics and Technology;2024-09-10

3. Can a large language model create acceptable dental board-style examination questions? A cross-sectional prospective study;Journal of Dental Sciences;2024-09

4. Capability of multimodal large language models to interpret pediatric radiological images;Pediatric Radiology;2024-08-12

5. Recent trends in AI applications for pelvic MRI: a comprehensive review;La radiologia medica;2024-08-03