Analyzing Evaluation Methods for Large Language Models in the Medical Field: A Scoping Review-Reference-Cited by-同舟云学术

Analyzing Evaluation Methods for Large Language Models in the Medical Field: A Scoping Review

Published:2024-01-24 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Lee Junbok¹,Park Sungkyung²,Shin Jaeyong³,Cho Belong⁴

Affiliation:

1. Yonsei University

2. Seoul National University of Science and Technology

3. Yonsei University College of Medicine

4. Seoul National University Hospital

Abstract

Abstract Background: Owing to the rapid growth in popularity of Large Language Models (LLM), various performance evaluation studies have been conducted to confirm their applicability in the medical field. However, there is still no clear framework for an LLM evaluation. Objective: By reviewing studies on LLM evaluations in the medical field and analyzing the research methods used in these studies, this study aims to provide a reference for future researchers designing LLM studies. Methods & Materials: We conducted a scoping review of three databases (PubMed, Embase, and MEDLINE) to identify LLMs published between January 1, 2023, and September 30, 2023. We analyzed the method type, number of questions (queries), evaluators, repeat measurements, additional analysis methods, engineered prompts, and metrics other than accuracy. Results: A total of 142 articles met the inclusion criteria. The LLM evaluation was primarily categorized as either providing test examinations (n=53, 37.3%) or being evaluated by a medical professional (n=80, 56.3%), with some hybrid cases (n=5, 3.5%) or a combination of the two (n=4, 2.8%). Most studies had 100 or fewer questions (n=18, 29.0%), 15 (24.2%) performed repeated measurements, 18 (29.0%) performed additional analyses, and 8 (12.9%) used prompt engineering. For medical assessment, most studies had 50 or fewer queries (n=54, 64.3%), most studies had two evaluators (n=43, 48.3%), and 14 (14.7%) used prompt engineering. Conclusions: More research is required regarding the application of LLMs in healthcare. Although previous studies have evaluated performance, future studies will likely focus on improving performance. For these studies to be conducted systematically, a well-structured methodology must be designed.

Publisher

Research Square Platform LLC

Reference158 articles.

1. Large language models in medicine;Thirunavukarasu AJ;Nat Med,2023

2. Chatting about ChatGPT: how may AI and GPT impact academia and libraries?;Lund BD;Libr Hi Tech News,2023

3. Large Language Models in Medical Education: Opportunities, Challenges, and Future Directions;Abd-Alrazaq A;JMIR Med Educ,2023

4. Applications of large language models in cancer care: current evidence and future perspectives;Iannantuono GM;Front Oncol,2023

5. Qiu J et al. Large ai models in health informatics: Applications, challenges, and the future. IEEE Journal of Biomedical and Health Informatics (2023). (2023).