Comparison of Large Language Models in Answering Immuno-Oncology Questions: A Cross-Sectional Study-Reference-Cited by-同舟云学术

Comparison of Large Language Models in Answering Immuno-Oncology Questions: A Cross-Sectional Study

Published:2024-02-03 Issue:5 Volume:29 Page:407-414
ISSN:1083-7159
Container-title:The Oncologist
language:en
Short-container-title:

Author:

Iannantuono Giovanni Maria¹,Bracken-Clarke Dara²,Karzai Fatima¹,Choo-Wosoba Hyoyoung³,Gulley James L²,Floudas Charalampos S²^ORCID

Affiliation:

1. Genitourinary Malignancies Branch, Center for Cancer Research, National Cancer Institute, National Institutes of Health , Bethesda, MD , USA

2. Center for Immuno-Oncology, Center for Cancer Research, National Cancer Institute, National Institutes of Health , Bethesda, MD , USA

3. Biostatistics and Data Management Section, Center for Cancer Research, National Cancer Institute, National Institutes of Health , Bethesda, MD , USA

Abstract

Abstract Background The capability of large language models (LLMs) to understand and generate human-readable text has prompted the investigation of their potential as educational and management tools for patients with cancer and healthcare providers. Materials and Methods We conducted a cross-sectional study aimed at evaluating the ability of ChatGPT-4, ChatGPT-3.5, and Google Bard to answer questions related to 4 domains of immuno-oncology (Mechanisms, Indications, Toxicities, and Prognosis). We generated 60 open-ended questions (15 for each section). Questions were manually submitted to LLMs, and responses were collected on June 30, 2023. Two reviewers evaluated the answers independently. Results ChatGPT-4 and ChatGPT-3.5 answered all questions, whereas Google Bard answered only 53.3% (P < .0001). The number of questions with reproducible answers was higher for ChatGPT-4 (95%) and ChatGPT3.5 (88.3%) than for Google Bard (50%) (P < .0001). In terms of accuracy, the number of answers deemed fully correct were 75.4%, 58.5%, and 43.8% for ChatGPT-4, ChatGPT-3.5, and Google Bard, respectively (P = .03). Furthermore, the number of responses deemed highly relevant was 71.9%, 77.4%, and 43.8% for ChatGPT-4, ChatGPT-3.5, and Google Bard, respectively (P = .04). Regarding readability, the number of highly readable was higher for ChatGPT-4 and ChatGPT-3.5 (98.1%) and (100%) compared to Google Bard (87.5%) (P = .02). Conclusion ChatGPT-4 and ChatGPT-3.5 are potentially powerful tools in immuno-oncology, whereas Google Bard demonstrated relatively poorer performance. However, the risk of inaccuracy or incompleteness in the responses was evident in all 3 LLMs, highlighting the importance of expert-driven verification of the outputs returned by these technologies.

Funder

National Institutes of Health

National Cancer Institute

Publisher

Oxford University Press (OUP)

Link

https://academic.oup.com/oncolo/article-pdf/29/5/407/57396318/oyae009.pdf

Reference37 articles.

1. Science in the age of large language models;Birhane,2023

2. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum;Ayers,2023

3. Health information on the internet: quality issues and international initiatives;Risk,2002

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Use of artificial intelligence chatbots in clinical management of immune-related adverse events;Journal for ImmunoTherapy of Cancer;2024-05

2. Accuracy of Different Generative Artificial Intelligence Models in Medical Question Answering: A Systematic Review and Network Meta-Analysis;2024