Evaluating the Capabilities of Generative AI Tools in Understanding Medical Papers: Qualitative Study-Reference-Cited by-同舟云学术

Evaluating the Capabilities of Generative AI Tools in Understanding Medical Papers: Qualitative Study

Published:2024-09-04 Issue: Volume:12 Page:e59258
ISSN:2291-9694
Container-title:JMIR Medical Informatics
language:en
Short-container-title:JMIR Med Inform

Author:

Akyon Seyma Handan^ORCID,Akyon Fatih Cagatay^ORCID,Camyar Ahmet Sefa^ORCID,Hızlı Fatih^ORCID,Sari Talha^ORCID,Hızlı Şamil^ORCID

Abstract

Background Reading medical papers is a challenging and time-consuming task for doctors, especially when the papers are long and complex. A tool that can help doctors efficiently process and understand medical papers is needed. Objective This study aims to critically assess and compare the comprehension capabilities of large language models (LLMs) in accurately and efficiently understanding medical research papers using the STROBE (Strengthening the Reporting of Observational Studies in Epidemiology) checklist, which provides a standardized framework for evaluating key elements of observational study. Methods The study is a methodological type of research. The study aims to evaluate the understanding capabilities of new generative artificial intelligence tools in medical papers. A novel benchmark pipeline processed 50 medical research papers from PubMed, comparing the answers of 6 LLMs (GPT-3.5-Turbo, GPT-4-0613, GPT-4-1106, PaLM 2, Claude v1, and Gemini Pro) to the benchmark established by expert medical professors. Fifteen questions, derived from the STROBE checklist, assessed LLMs’ understanding of different sections of a research paper. Results LLMs exhibited varying performance, with GPT-3.5-Turbo achieving the highest percentage of correct answers (n=3916, 66.9%), followed by GPT-4-1106 (n=3837, 65.6%), PaLM 2 (n=3632, 62.1%), Claude v1 (n=2887, 58.3%), Gemini Pro (n=2878, 49.2%), and GPT-4-0613 (n=2580, 44.1%). Statistical analysis revealed statistically significant differences between LLMs (P<.001), with older models showing inconsistent performance compared to newer versions. LLMs showcased distinct performances for each question across different parts of a scholarly paper—with certain models like PaLM 2 and GPT-3.5 showing remarkable versatility and depth in understanding. Conclusions This study is the first to evaluate the performance of different LLMs in understanding medical papers using the retrieval augmented generation method. The findings highlight the potential of LLMs to enhance medical research by improving efficiency and facilitating evidence-based decision-making. Further research is needed to address limitations such as the influence of question formats, potential biases, and the rapid evolution of LLM models.

Publisher

JMIR Publications Inc.

Reference34 articles.

1. Generative artificial intelligence in the metaverse era

2. Artificial intelligence in healthcare: past, present and future

3. Human-like problem-solving abilities in large language models using ChatGPT

4. Deep Learning Transformer Models for Building a Comprehensive and Real-time Trauma Observatory: Development and Validation Study

5. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models