Hallucination Rates and Reference Accuracy of ChatGPT and Bard for Systematic Reviews: Comparative Analysis-Reference-Cited by-同舟云学术

Hallucination Rates and Reference Accuracy of ChatGPT and Bard for Systematic Reviews: Comparative Analysis

Published:2024-05-22 Issue: Volume:26 Page:e53164
ISSN:1438-8871
Container-title:Journal of Medical Internet Research
language:en
Short-container-title:J Med Internet Res

Author:

Chelli Mikaël^ORCID,Descamps Jules^ORCID,Lavoué Vincent^ORCID,Trojani Christophe^ORCID,Azar Michel^ORCID,Deckert Marcel^ORCID,Raynier Jean-Luc^ORCID,Clowez Gilles^ORCID,Boileau Pascal^ORCID,Ruetsch-Chelli Caroline^ORCID

Abstract

Background Large language models (LLMs) have raised both interest and concern in the academic community. They offer the potential for automating literature search and synthesis for systematic reviews but raise concerns regarding their reliability, as the tendency to generate unsupported (hallucinated) content persist. Objective The aim of the study is to assess the performance of LLMs such as ChatGPT and Bard (subsequently rebranded Gemini) to produce references in the context of scientific writing. Methods The performance of ChatGPT and Bard in replicating the results of human-conducted systematic reviews was assessed. Using systematic reviews pertaining to shoulder rotator cuff pathology, these LLMs were tested by providing the same inclusion criteria and comparing the results with original systematic review references, serving as gold standards. The study used 3 key performance metrics: recall, precision, and F1-score, alongside the hallucination rate. Papers were considered “hallucinated” if any 2 of the following information were wrong: title, first author, or year of publication. Results In total, 11 systematic reviews across 4 fields yielded 33 prompts to LLMs (3 LLMs×11 reviews), with 471 references analyzed. Precision rates for GPT-3.5, GPT-4, and Bard were 9.4% (13/139), 13.4% (16/119), and 0% (0/104) respectively (P<.001). Recall rates were 11.9% (13/109) for GPT-3.5 and 13.7% (15/109) for GPT-4, with Bard failing to retrieve any relevant papers (P<.001). Hallucination rates stood at 39.6% (55/139) for GPT-3.5, 28.6% (34/119) for GPT-4, and 91.4% (95/104) for Bard (P<.001). Further analysis of nonhallucinated papers retrieved by GPT models revealed significant differences in identifying various criteria, such as randomized studies, participant criteria, and intervention criteria. The study also noted the geographical and open-access biases in the papers retrieved by the LLMs. Conclusions Given their current performance, it is not recommended for LLMs to be deployed as the primary or exclusive tool for conducting systematic reviews. Any references generated by such models warrant thorough validation by researchers. The high occurrence of hallucinations in LLMs highlights the necessity for refining their training and functionality before confidently using them for rigorous academic purposes.

Publisher

JMIR Publications Inc.

Reference32 articles.

1. Chatting about ChatGPT: how may AI and GPT impact academia and libraries?

2. ChatGPT: friend or foe?

3. Abstracts written by ChatGPT fool scientists

4. ChatGPT and the Future of Medical Writing

5. Can artificial intelligence help for scientific writing?

Cited by 3 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Comparison of ChatGPT and Gemini as sources of references in otorhinolaryngology;2024-08-13

2. Sligpt: A Large Language Model-Based Approach for Data Dependency Analysis on Solidity Smart Contracts;Software;2024-08-05

3. Leading Journals in ChatGPT Articles with More References Co-citations Using Slope Graphs in R with Web R-Platform: Bibliometric Analysis (Preprint);2024-07-23