AI Chatbots’ Medical Hallucination: Innovation of References Hallucination Score and Comparison of Six Large Language Models (Preprint)

Author:

Aljamaan FadiORCID,Temsah Mohamad-HaniORCID,Tamimi Ibraheem,Al-Al-Eyadhy AymanORCID,Jamal AmrORCID,Alhasan KhalidORCID,Mesallam Tamer A.ORCID,Farahat MohamedORCID,Malki Khalid H.ORCID

Abstract

BACKGROUND

Artificial intelligence (AI) chatbots have gained use recently in medical practice by healthcare practitioners. Interestingly, their output was found to have varying degrees of hallucination in content and references. Such hallucinations generate doubts about their output and their implementation.

OBJECTIVE

We propose a reference hallucination score (RHS) to evaluate AI chatbots’ citation authenticity.

METHODS

Six AI chatbots were challenged with the same ten medical prompts, requesting ten references per prompt. The Reference Hallucination Score (RHS) is composed of six bibliographic items and the reference’s relevance to prompts’ keywords. RHS was calculated for each reference, prompt, and type of prompt (basic versus complex). The average RHS was calculated for each AI chatbot and compared across the different types of prompts and AI chatbots.

RESULTS

Bard failed to generate any references. ChatGPT 3.5 and Bing generated the highest RHS (11), while Elicit and SciSpace generated the lowest RHS, and Perplexity was in the middle. The highest degree of hallucination was observed for reference relevancy to the prompt keywords (61.6%), while the lowest was reference titles (33.8%). AI chatbots generally had significantly higher RHS when prompted with scenarios or complex format prompts.

CONCLUSIONS

The variation in RHS underscores the necessity for a robust reference evaluation tool to improve the authenticity of AI chatbots. Also, it highlights the importance of verifying their output and citations. Elicit and SciSpace had negligible hallucination, while ChatGPT and Bing had critical levels. The proposed AI chatbots’ RHS could contribute to ongoing efforts to enhance AI’s general reliability in medical research.

Publisher

JMIR Publications Inc.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3