Zero‐ and few‐shot prompting of generative large language models provides weak assessment of risk of bias in clinical trials-Reference-Cited by-同舟云学术

Zero‐ and few‐shot prompting of generative large language models provides weak assessment of risk of bias in clinical trials

Published:2024-08-23 Issue: Volume: Page:
ISSN:1759-2879
Container-title:Research Synthesis Methods
language:en
Short-container-title:Research Synthesis Methods

Author:

Šuster Simon¹^ORCID,Baldwin Timothy²,Verspoor Karin¹³^ORCID

Affiliation:

1. School of Computing and Information Systems, The University of Melbourne Melbourne Victoria Australia

2. Department of Natural Language Processing MBZUAI Abu Dhabi United Arab Emirates

3. School of Computing Technologies, RMIT University Melbourne Victoria Australia

Abstract

AbstractExisting systems for automating the assessment of risk‐of‐bias (RoB) in medical studies are supervised approaches that require substantial training data to work well. However, recent revisions to RoB guidelines have resulted in a scarcity of available training data. In this study, we investigate the effectiveness of generative large language models (LLMs) for assessing RoB. Their application requires little or no training data and, if successful, could serve as a valuable tool to assist human experts during the construction of systematic reviews. Following Cochrane's latest guidelines (RoB2) designed for human reviewers, we prepare instructions that are fed as input to LLMs, which then infer the risk associated with a trial publication. We distinguish between two modelling tasks: directly predicting RoB2 from text; and employing decomposition, in which a RoB2 decision is made after the LLM responds to a series of signalling questions. We curate new testing data sets and evaluate the performance of four general‐ and medical‐domain LLMs. The results fall short of expectations, with LLMs seldom surpassing trivial baselines. On the direct RoB2 prediction test set (n = 5993), LLMs perform akin to the baselines (F1: 0.1–0.2). In the decomposition task setup (n = 28,150), similar F1 scores are observed. Our additional comparative evaluation on RoB1 data also reveals results substantially below those of a supervised system. This testifies to the difficulty of solving this task based on (complex) instructions alone. Using LLMs as an assisting technology for assessing RoB2 thus currently seems beyond their reach.

Funder

Australian Research Council

Publisher

Wiley

Link

https://onlinelibrary.wiley.com/doi/pdf/10.1002/jrsm.1749

Reference69 articles.

1. The Cochrane Collaboration's tool for assessing risk of bias in randomised trials

2. Cochrane.Cochrane Database of Systematic Reviews. 2023. Accessed December 6 2023.https://www.cochranelibrary.com/

3. Automating Risk of Bias Assessment for Clinical Trials

4. Trialstreamer: A living, automatically updated database of clinical trial reports

5. Technology-assisted risk of bias assessment in systematic reviews: a prospective cross-sectional evaluation of the RobotReviewer machine learning tool