Assessing Large Language Models’ Proficiency, Clarity, and Objectivity at the Intersection of Obstetrics, Gynecology, and Global Public Health: Cross-Sectional, Comparative Analysis with Specialists' Knowledge on COVID-19 Impacts in Pregnancy (Preprint)

Author:

Bragazzi NicolaORCID,Buchinger Michèle,Atwan Hisham,Tuma Ruba,Chirico Francesco,Szarpak Lukasz,Farah Raymond,Khamisy-Farah RolaORCID

Abstract

BACKGROUND

The COVID-19 pandemic has significantly strained healthcare systems globally, leading to an overwhelming influx of patients and exacerbating resource limitations. Concurrently, an “infodemic” of misinformation, particularly prevalent in women's health, has emerged. This challenge has been pivotal for healthcare providers, especially gynecologists and obstetricians, in managing pregnant women's health. The pandemic heightened risks for pregnant women from COVID-19, necessitating balanced advice from specialists on vaccine safety versus known risks. Additionally, the advent of generative Artificial Intelligence (AI), such as large language models (LLMs), offers promising support in healthcare. However, they necessitate rigorous testing.

OBJECTIVE

To assess LLMs’ proficiency, clarity, and objectivity regarding COVID-19 impacts in pregnancy.

METHODS

This study evaluates four major AI prototypes (ChatGPT-3.5, ChatGPT-4, Microsoft Copilot, and Google Bard) using zero-shot prompts in a questionnaire validated among 172 Israeli gynecologists and obstetricians. The questionnaire assesses proficiency in providing accurate information on COVID-19 in relation to pregnancy. Text-mining, sentiment analysis, and readability (Flesch-Kincaid grade level) were also conducted.

RESULTS

In terms of LLMs’ knowledge, ChatGPT-4 and Microsoft Copilot each scored 96.7%, Google Bard 93.3%, and ChatGPT-3.5 80.0%. Concerning misinformation instances, ChatGPT-4 incorrectly stated an increased risk of miscarriage due to COVID-19. Google Bard and Microsoft Copilot had minor inaccuracies concerning COVID-19 transmission and complications. At the sentiment analysis, polarity scores were moderately positive, with ChatGPT-4 at 0.37, followed by Microsoft Copilot at 0.33, ChatGPT-3.5 at 0.25, and Google Bard at 0.23. Subjectivity levels were moderate, with Microsoft Copilot being the most objective (0.42). Finally, concerning the readability analysis, Flesch-Kincaid Grade Level showed ChatGPT-3.5 at 25.34, followed by Google Bard at 18.30, Microsoft Copilot at 11.27, and ChatGPT-4 at 21.12.

CONCLUSIONS

The study highlights varying knowledge levels of LLMs in relation to COVID-19 and pregnancy. ChatGPT-3.5 showed the least knowledge and alignment with scientific evidence. Readability and complexity analyses suggest that each AI's approach is tailored to specific audiences, with ChatGPT versions being more suitable for specialized readers. The sentiment analysis underscores the importance of factual and objective information dissemination. Overall, ChatGPT-4, Microsoft Copilot, and Google Bard generally provide accurate, updated information on COVID-19 and vaccines in women's health, aligning with health guidelines. The study demonstrates the potential role of AI in supplementing healthcare knowledge, with a need for continuous updating and verification of AI knowledge bases. The choice of AI tool should consider the target audience and required information detail level.

Publisher

JMIR Publications Inc.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3