Exploring the Pitfalls of Large Language Models: Inconsistency and Inaccuracy in Answering Pathology Board Examination-Style Questions-Reference-Cited by-同舟云学术

Exploring the Pitfalls of Large Language Models: Inconsistency and Inaccuracy in Answering Pathology Board Examination-Style Questions

Published:2023-08-08 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Koga Shunsuke^ORCID

Abstract

AbstractIn the rapidly advancing field of artificial intelligence, large language models (LLMs) such as ChatGPT and Google Bard are making significant progress, with applications extending across various fields, including medicine. This study explores their potential utility and pitfalls by assessing the performance of these LLMs in answering 150 multiple-choice questions, encompassing 15 subspecialties in pathology, sourced from thePathologyOutlines.comQuestion Bank, a resource for pathology examination preparation. Overall, ChatGPT outperformed Google Bard, scoring 122 out of 150, while Google Bard achieved a score of 70. Additionally, we explored the consistency of these LLMs by applying a test-retest approach over a two-week interval. ChatGPT showed a consistency rate of 85%, while Google Bard exhibited a consistency rate of 61%. In-depth analysis of incorrect responses identified potential factual inaccuracies and interpretive errors. While LLMs have potential to enhance medical education and assist clinical decision-making, their current limitations underscore the need for continued development and the critical role of human expertise in the application of such models.

Publisher

Cold Spring Harbor Laboratory

Reference18 articles.

1. Chen M , Tworek J , Jun H , et al. Evaluating Large Language Models Trained on Code. 2021; arXiv:2107.03374.

2. Koga S. The Potential of ChatGPT in Medical Education: Focusing on USMLE Preparation. Ann Biomed Eng. 2023.

3. ChatGPT and the Future of Medical Education;Acad Med,2023

4. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models

5. Performance of GPT-3.5 and GPT-4 on the Japanese Medical Licensing Examination: Comparison Study

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Effectiveness of ChatGPT in Coding: A Comparative Analysis of Popular Large Language Models;Digital;2024-01-08