Evaluating large language models on a highly-specialized topic, radiation oncology physics-Reference-Cited by-同舟云学术

Evaluating large language models on a highly-specialized topic, radiation oncology physics

Published:2023-07-17 Issue: Volume:13 Page:
ISSN:2234-943X
Container-title:Frontiers in Oncology
language:
Short-container-title:Front. Oncol.

Author:

Holmes Jason,Liu Zhengliang,Zhang Lian,Ding Yuzhen,Sio Terence T.,McGee Lisa A.,Ashman Jonathan B.,Li Xiang,Liu Tianming,Shen Jiajian,Liu Wei

Abstract

PurposeWe present the first study to investigate Large Language Models (LLMs) in answering radiation oncology physics questions. Because popular exams like AP Physics, LSAT, and GRE have large test-taker populations and ample test preparation resources in circulation, they may not allow for accurately assessing the true potential of LLMs. This paper proposes evaluating LLMs on a highly-specialized topic, radiation oncology physics, which may be more pertinent to scientific and medical communities in addition to being a valuable benchmark of LLMs.MethodsWe developed an exam consisting of 100 radiation oncology physics questions based on our expertise. Four LLMs, ChatGPT (GPT-3.5), ChatGPT (GPT-4), Bard (LaMDA), and BLOOMZ, were evaluated against medical physicists and non-experts. The performance of ChatGPT (GPT-4) was further explored by being asked to explain first, then answer. The deductive reasoning capability of ChatGPT (GPT-4) was evaluated using a novel approach (substituting the correct answer with “None of the above choices is the correct answer.”). A majority vote analysis was used to approximate how well each group could score when working together.ResultsChatGPT GPT-4 outperformed all other LLMs and medical physicists, on average, with improved accuracy when prompted to explain before answering. ChatGPT (GPT-3.5 and GPT-4) showed a high level of consistency in its answer choices across a number of trials, whether correct or incorrect, a characteristic that was not observed in the human test groups or Bard (LaMDA). In evaluating deductive reasoning ability, ChatGPT (GPT-4) demonstrated surprising accuracy, suggesting the potential presence of an emergent ability. Finally, although ChatGPT (GPT-4) performed well overall, its intrinsic properties did not allow for further improvement when scoring based on a majority vote across trials. In contrast, a team of medical physicists were able to greatly outperform ChatGPT (GPT-4) using a majority vote.ConclusionThis study suggests a great potential for LLMs to work alongside radiation oncology experts as highly knowledgeable assistants.

Publisher

Frontiers Media SA

Subject

Cancer Research,Oncology

Reference52 articles.

1. When brain-inspired ai meets agi;Zhao;arXiv preprint,2023

2. Bert: Pre-training of deep bidirectional transformers for language understanding;Devlin;arXiv preprint,2018

3. Domain-specific language model pretraining for biomedical natural language processing;Gu;ACM Trans Comput Healthc (HEALTH),2021

4. Survey on natural language processing in medical image analysis;Liu;Zhong nan da xue xue bao. Yi xue ban= J Cent South University Med Sci,2022

5. Agribert: knowledge-infused agricultural language models for matching food and nutrition;Rezayi;IJCAI,2022

Cited by 48 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Benchmarking a Foundation Large Language Model on its Ability to Relabel Structure Names in Accordance With the American Association of Physicists in Medicine Task Group-263 Report;Practical Radiation Oncology;2024-09

2. How do large language models answer breast cancer quiz questions? A comparative study of GPT-3.5, GPT-4 and Google Gemini;La radiologia medica;2024-08-13

3. Influence of Model Evolution and System Roles on ChatGPT’s Performance in Chinese Medical Licensing Exams: Comparative Study;JMIR Medical Education;2024-08-13

4. Large language models in healthcare: from a systematic review on medical examinations to a comparative analysis on fundamentals of robotic surgery online test;Artificial Intelligence Review;2024-08-06

5. A joint ESTRO and AAPM guideline for development, clinical validation and reporting of artificial intelligence models in radiation therapy;Radiotherapy and Oncology;2024-08