Automated Paper Screening for Clinical Reviews Using Large Language Models: Data Analysis Study-Reference-Cited by-同舟云学术

Automated Paper Screening for Clinical Reviews Using Large Language Models: Data Analysis Study

Published:2024-01-12 Issue: Volume:26 Page:e48996
ISSN:1438-8871
Container-title:Journal of Medical Internet Research
language:en
Short-container-title:J Med Internet Res

Author:

Guo Eddie^ORCID,Gupta Mehul^ORCID,Deng Jiawen^ORCID,Park Ye-Jean^ORCID,Paget Michael^ORCID,Naugler Christopher^ORCID

Abstract

Background The systematic review of clinical research papers is a labor-intensive and time-consuming process that often involves the screening of thousands of titles and abstracts. The accuracy and efficiency of this process are critical for the quality of the review and subsequent health care decisions. Traditional methods rely heavily on human reviewers, often requiring a significant investment of time and resources. Objective This study aims to assess the performance of the OpenAI generative pretrained transformer (GPT) and GPT-4 application programming interfaces (APIs) in accurately and efficiently identifying relevant titles and abstracts from real-world clinical review data sets and comparing their performance against ground truth labeling by 2 independent human reviewers. Methods We introduce a novel workflow using the Chat GPT and GPT-4 APIs for screening titles and abstracts in clinical reviews. A Python script was created to make calls to the API with the screening criteria in natural language and a corpus of title and abstract data sets filtered by a minimum of 2 human reviewers. We compared the performance of our model against human-reviewed papers across 6 review papers, screening over 24,000 titles and abstracts. Results Our results show an accuracy of 0.91, a macro F1-score of 0.60, a sensitivity of excluded papers of 0.91, and a sensitivity of included papers of 0.76. The interrater variability between 2 independent human screeners was κ=0.46, and the prevalence and bias-adjusted κ between our proposed methods and the consensus-based human decisions was κ=0.96. On a randomly selected subset of papers, the GPT models demonstrated the ability to provide reasoning for their decisions and corrected their initial decisions upon being asked to explain their reasoning for incorrect classifications. Conclusions Large language models have the potential to streamline the clinical review process, save valuable time and effort for researchers, and contribute to the overall quality of clinical reviews. By prioritizing the workflow and acting as an aid rather than a replacement for researchers and reviewers, models such as GPT-4 can enhance efficiency and lead to more accurate and reliable conclusions in medical research.

Publisher

JMIR Publications Inc.

Subject

Health Informatics

Reference39 articles.

1. Scoping Reviews, Systematic Reviews, and Meta-Analysis: Applications in Veterinary Medicine

2. Knowledge Synthesis in Evidence-Based Medicine

3. Assessing the quality of studies in meta‐research: Review/guidelines on the most important quality assessment tools

4. Single-reviewer abstract screening missed 13 percent of relevant studies: a crowd-based, randomized controlled trial

5. What is heterogeneity and is it important?

Cited by 22 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. ChatGPT and neurosurgical education: A crossroads of innovation and opportunity;Journal of Clinical Neuroscience;2024-11

2. Implementation and evaluation of an additional GPT-4-based reviewer in PRISMA-based medical systematic literature reviews;International Journal of Medical Informatics;2024-09

3. Future of Evidence Synthesis: Automated, Living, and Interactive Systematic Reviews and Meta-analyses;Mayo Clinic Proceedings: Digital Health;2024-09

4. Human-Comparable Sensitivity of Large Language Models in Identifying Eligible Studies Through Title and Abstract Screening: 3-Layer Strategy Using GPT-3.5 and GPT-4 for Systematic Reviews;Journal of Medical Internet Research;2024-08-16

5. Automated information extraction model enhancing traditional Chinese medicine RCT evidence extraction (Evi-BERT): algorithm development and validation;Frontiers in Artificial Intelligence;2024-08-15