BACKGROUND
Online medical information, like ChatGPT, is crucial for patients making health decisions. However, many struggle with low literacy skills when using such content. To help, we need to ensure that the information is easily readable for the average adult. Surprisingly, there's been no research on how well ChatGPT delivers medical information in text form.
OBJECTIVE
To assess the readability and presentation suitability of ChatGPT responses to the most commonly asked patient questions, as well as ChatGPT's ability to improve readability.
METHODS
This study involves two phases. First, we evaluated ChatGPT's medical responses for the readability and presentation suitability using 30 knee osteoarthritis (OA)-related questions on March 20, 2023. We applied the Flesch-Kincaid Grade Level (FKGL) and Simple Measure Of the Gobbledygook (SMOG) readability formulas. Additionally, we used three evaluation tools: the Suitability Assessment of Materials (SAM) for presentation scores, the Ensuring Quality Information for Patients (EQIP), and the modified DISCERN (mDISCERN) for overall quality scores. Secondly, we assessed the readability improvement for answers to 50 stroke-related questions by providing both detailed and simple instructions into ChatGPT. In this phase, we also utilized FKGL and SMOG readability tests.
RESULTS
In the readability assessment, the mean (standard deviation, SD) scores for the 30 responses regarding knee OA were as follows: FKGL, 13.65 (1.80) reading grade and SMOG, 15.62 (1.55) reading grade, all of which were statistically higher than the recommended sixth-grade reading level (P < 0.001). In the presentation suitability assessment, SAM score for all answers was 55.00 (10.64), which is considered “adequate.” The mean EQIP and mDISCERN scores were 43.72 (5.78) and 2.83 (0.59), respectively, and none of the responses was evaluated as high quality. Upon implementing both detailed and simple instructions to the 50 responses regarding stroke, the ANOVA test results indicate statistically significant differences in mean readability scores among the three groups: pre-intervention, post-intervention with detailed instructions, and post-intervention with simple instructions (P < 0.001). Post-hoc analysis revealed that the pre-intervention group differed significantly from both post-intervention groups in both readability assessments (P < 0.001, respectively). However, there was no significant difference between the two post-intervention groups (P = 0.96 for FKGL and 0.86 for SMOG).
CONCLUSIONS
This study discovered that ChatGPT responses are hard to read and have low quality, which may discomfort patients, despite their adequate presentation of medical information. Furthermore, ChatGPT lacks the ability to improve medical information's readability. As technology advances, enhancing ChatGPT's readability and user-friendliness will increase its usefulness for patients.
CLINICALTRIAL
Not applicable.