Benchmarking ChatGPT-4 on a radiation oncology in-training exam and Red Journal Gray Zone cases: potentials and challenges for ai-assisted medical education and decision making in radiation oncology-Reference-Cited by-同舟云学术

Benchmarking ChatGPT-4 on a radiation oncology in-training exam and Red Journal Gray Zone cases: potentials and challenges for ai-assisted medical education and decision making in radiation oncology

Published:2023-09-14 Issue: Volume:13 Page:
ISSN:2234-943X
Container-title:Frontiers in Oncology
language:
Short-container-title:Front. Oncol.

Author:

Huang Yixing,Gomaa Ahmed,Semrau Sabine,Haderlein Marlen,Lettmaier Sebastian,Weissmann Thomas,Grigo Johanna,Tkhayat Hassen Ben,Frey Benjamin,Gaipl Udo,Distel Luitpold,Maier Andreas,Fietkau Rainer,Bert Christoph,Putz Florian

Abstract

PurposeThe potential of large language models in medicine for education and decision-making purposes has been demonstrated as they have achieved decent scores on medical exams such as the United States Medical Licensing Exam (USMLE) and the MedQA exam. This work aims to evaluate the performance of ChatGPT-4 in the specialized field of radiation oncology.MethodsThe 38th American College of Radiology (ACR) radiation oncology in-training (TXIT) exam and the 2022 Red Journal Gray Zone cases are used to benchmark the performance of ChatGPT-4. The TXIT exam contains 300 questions covering various topics of radiation oncology. The 2022 Gray Zone collection contains 15 complex clinical cases.ResultsFor the TXIT exam, ChatGPT-3.5 and ChatGPT-4 have achieved the scores of 62.05% and 78.77%, respectively, highlighting the advantage of the latest ChatGPT-4 model. Based on the TXIT exam, ChatGPT-4’s strong and weak areas in radiation oncology are identified to some extent. Specifically, ChatGPT-4 demonstrates better knowledge of statistics, CNS & eye, pediatrics, biology, and physics than knowledge of bone & soft tissue and gynecology, as per the ACR knowledge domain. Regarding clinical care paths, ChatGPT-4 performs better in diagnosis, prognosis, and toxicity than brachytherapy and dosimetry. It lacks proficiency in in-depth details of clinical trials. For the Gray Zone cases, ChatGPT-4 is able to suggest a personalized treatment approach to each case with high correctness and comprehensiveness. Importantly, it provides novel treatment aspects for many cases, which are not suggested by any human experts.ConclusionBoth evaluations demonstrate the potential of ChatGPT-4 in medical education for the general public and cancer patients, as well as the potential to aid clinical decision-making, while acknowledging its limitations in certain domains. Owing to the risk of hallucinations, it is essential to verify the content generated by models such as ChatGPT for accuracy.

Publisher

Frontiers Media SA

Subject

Cancer Research,Oncology

Reference53 articles.

1. Attention is all you need;Vaswani;Adv Neural Inf Process Syst,2017

2. Language models are few-shot learners;Brown;Adv Neural Inf Process Syst,2020

3. Chain of thought prompting elicits reasoning in large language models;Wei;NeurIPS,2022

4. ChatGPT, bard, and large language models for biomedical research: opportunities and pitfalls;Thapa;Ann Biomed Eng,2023

5. LLaMA: Open and efficient foundation language models;Touvron;arXiv preprint arXiv:2302.13971,2023

Cited by 27 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Beyond the Scalpel: Assessing ChatGPT's potential as an auxiliary intelligent virtual assistant in oral surgery;Computational and Structural Biotechnology Journal;2024-12

2. Estudio comparativo de la capacidad de aprendizaje de ChatGPT en la resolución de preguntas de especialización médica;Open Respiratory Archives;2024-10

3. Assessing the role of advanced artificial intelligence as a tool in multidisciplinary tumor board decision-making for recurrent/metastatic head and neck cancer cases – the first study on ChatGPT 4o and a comparison to ChatGPT 4.0;Frontiers in Oncology;2024-09-05

4. Benchmarking a Foundation Large Language Model on its Ability to Relabel Structure Names in Accordance With the American Association of Physicists in Medicine Task Group-263 Report;Practical Radiation Oncology;2024-09

5. Assessing the use of the novel tool Claude 3 in comparison to ChatGPT 4.0 as an artificial intelligence tool in the diagnosis and therapy of primary head and neck cancer cases;European Archives of Oto-Rhino-Laryngology;2024-08-07