Evaluating Chat Generative Pre-trained Transformer Responses to Common Pediatric In-toeing Questions

Author:

Amaral Jason ZarahiORCID,Schultz Rebecca J.ORCID,Martin Benjamin M.ORCID,Taylor TristenORCID,Touban BaselORCID,McGraw-Heinrich JessicaORCID,McKay Scott D.ORCID,Rosenfeld Scott B.ORCID,Smith Brian G.ORCID

Abstract

Objective: Chat generative pre-trained transformer (ChatGPT) has garnered attention in health care for its potential to reshape patient interactions. As patients increasingly rely on artificial intelligence platforms, concerns about information accuracy arise. In-toeing, a common lower extremity variation, often leads to pediatric orthopaedic referrals despite observation being the primary treatment. Our study aims to assess ChatGPT’s responses to pediatric in-toeing questions, contributing to discussions on health care innovation and technology in patient education. Methods: We compiled a list of 34 common in-toeing questions from the “Frequently Asked Questions” sections of 9 health care–affiliated websites, identifying 25 as the most encountered. On January 17, 2024, we queried ChatGPT 3.5 in separate sessions and recorded the responses. These 25 questions were posed again on January 21, 2024, to assess its reproducibility. Two pediatric orthopaedic surgeons evaluated responses using a scale of “excellent (no clarification)” to “unsatisfactory (substantial clarification).” Average ratings were used when evaluators’ grades were within one level of each other. In discordant cases, the senior author provided a decisive rating. Results: We found 46% of ChatGPT responses were “excellent” and 44% “satisfactory (minimal clarification).” In addition, 8% of cases were “satisfactory (moderate clarification)” and 2% were “unsatisfactory.” Questions had appropriate readability, with an average Flesch-Kincaid Grade Level of 4.9 (±2.1). However, ChatGPT’s responses were at a collegiate level, averaging 12.7 (±1.4). No significant differences in ratings were observed between question topics. Furthermore, ChatGPT exhibited moderate consistency after repeated queries, evidenced by a Spearman rho coefficient of 0.55 (P = 0.005). The chatbot appropriately described in-toeing as normal or spontaneously resolving in 62% of responses and consistently recommended evaluation by a health care provider in 100%. Conclusion: The chatbot presented a serviceable, though not perfect, representation of the diagnosis and management of pediatric in-toeing while demonstrating a moderate level of reproducibility in its responses. ChatGPT’s utility could be enhanced by improving readability and consistency and incorporating evidence-based guidelines. Level of Evidence: Level IV—diagnostic.

Publisher

Ovid Technologies (Wolters Kluwer Health)

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3