BACKGROUND
Artificial intelligence (AI) and large language models (LLMs) are emerging as the transformative force in various fields, notably in medicine. Their effectiveness in creating physical exercise rehabilitation program and providing information on musculoskeletal (MSK) disorders has yet to be fully explored.
OBJECTIVE
To assess the quality and readability of an LLM’s responses to consultation questions addressing the various phases throughout the entire clinical process experienced by patients with chronic musculoskeletal disorders.
METHODS
This cross-sectional study retrieved frequently asked questions from Google (accessed September 3 to October 24, 2023) and randomly selected 25 adult patients suffering from chronic musculoskeletal pain. Three different clinical scenario questions were designed to simulate the entire process of real-world clinical consultations. These questions were used as queries for an AI LLM, ChatGPT version 4.0 (accessed September 23 to December 24, 2023), to prompt LLM-generate responses. The quality of the responses was evaluated by two independent orthopedic clinicians with DISCERN instrument, and the readability was assessed on the WedFX tool website (accessed December 14 to December 20, 2023). Statistical analysis was conducted from January to April 2024.
RESULTS
Of the 98 generated programs, the response format was relatively fixed, based on the queries provided to the LLM. The mean (SD) DISCERN scores assigned by the two doctors were 52.49 (8.57) and 51.50 (8.64), respectively, with all scores ranging from 32 to 67. The analysis of variance between the two physicians (p=0.42) and among the five musculoskeletal disorders (p=0.08) showed no significant difference. The Cohen κ coefficient was calculated to be 0.73 (95% CI −0.710 to 0.756), signifying good internal agreement, while Cronbach's α value is 0.834, indicating good reliability. The mean (SD) scores across the six readability tools for all programs were as follows: FRES: 53.04 (16.23), FKGL: 10.15 (3.42), GF: 12.52 (3.41), SMOG: 9.87 (2.75), CLI: 13.28 (1.97), ARI: 10.48 (3.80). All these tools’ readability score were significantly different (worse) than the recommended reading level (p<0.05).
CONCLUSIONS
In this cross-sectional study, the LLM generated high quality physical exercise prescriptions throughout the entire consultation process of patients with musculoskeletal disorders. However, the lack of supporting materials and the readability of the responses was no sufficiently public-friendly. These findings suggest that, with professional physician evaluation and some improvements, LLMs could be potentially and widely applied in the field of orthopedics in the future.