Quality and Comprehensibility of Large Language Model Responses to Common Patient Questions Regarding Musculoskeletal Disorders (Preprint)

Author:

Fu YuORCID,Chen Xi,You Mingke,Wang Lingcheng,Wang LiORCID,Liu Weizhi,Zhou Kai,Chen Gang

Abstract

BACKGROUND

Artificial intelligence (AI) and large language models (LLMs) are emerging as the transformative force in various fields, notably in medicine. Their effectiveness in creating physical exercise rehabilitation program and providing information on musculoskeletal (MSK) disorders has yet to be fully explored.

OBJECTIVE

To assess the quality and readability of an LLM’s responses to consultation questions addressing the various phases throughout the entire clinical process experienced by patients with chronic musculoskeletal disorders.

METHODS

This cross-sectional study retrieved frequently asked questions from Google (accessed September 3 to October 24, 2023) and randomly selected 25 adult patients suffering from chronic musculoskeletal pain. Three different clinical scenario questions were designed to simulate the entire process of real-world clinical consultations. These questions were used as queries for an AI LLM, ChatGPT version 4.0 (accessed September 23 to December 24, 2023), to prompt LLM-generate responses. The quality of the responses was evaluated by two independent orthopedic clinicians with DISCERN instrument, and the readability was assessed on the WedFX tool website (accessed December 14 to December 20, 2023). Statistical analysis was conducted from January to April 2024.

RESULTS

Of the 98 generated programs, the response format was relatively fixed, based on the queries provided to the LLM. The mean (SD) DISCERN scores assigned by the two doctors were 52.49 (8.57) and 51.50 (8.64), respectively, with all scores ranging from 32 to 67. The analysis of variance between the two physicians (p=0.42) and among the five musculoskeletal disorders (p=0.08) showed no significant difference. The Cohen κ coefficient was calculated to be 0.73 (95% CI −0.710 to 0.756), signifying good internal agreement, while Cronbach's α value is 0.834, indicating good reliability. The mean (SD) scores across the six readability tools for all programs were as follows: FRES: 53.04 (16.23), FKGL: 10.15 (3.42), GF: 12.52 (3.41), SMOG: 9.87 (2.75), CLI: 13.28 (1.97), ARI: 10.48 (3.80). All these tools’ readability score were significantly different (worse) than the recommended reading level (p<0.05).

CONCLUSIONS

In this cross-sectional study, the LLM generated high quality physical exercise prescriptions throughout the entire consultation process of patients with musculoskeletal disorders. However, the lack of supporting materials and the readability of the responses was no sufficiently public-friendly. These findings suggest that, with professional physician evaluation and some improvements, LLMs could be potentially and widely applied in the field of orthopedics in the future.

Publisher

JMIR Publications Inc.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3