Comparative Analysis of Responses to Online Autistic Patients' Questions in Chinese: Physicians vs. Large Language Model Chatbots (Preprint)-Reference-Cited by-同舟云学术

Comparative Analysis of Responses to Online Autistic Patients' Questions in Chinese: Physicians vs. Large Language Model Chatbots (Preprint)

Published:2023-11-20 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

He Wenjie^ORCID,Zhang Wenyan,Jin Ya,Zhou Qiang,Zhang Huadan,Xia Qing^ORCID

Abstract

BACKGROUND

There is a dearth of Feasibility assessments regarding the utilization of large language models for responding to inquiries from autistic patients within a Chinese-language context. Despite Chinese being one of the most widely spoken languages globally, the predominant focus of research on the application of these models in the medical field has been on English-speaking populations.

OBJECTIVE

To assess the effectiveness of LLM chatbots, specifically ChatGPT and ERNIE Bot, in addressing inquiries from individuals with autism in a Chinese setting.

METHODS

A total of 100 patient consultation samples were randomly selected from publicly available autism-related records on DXY, spanning the period from January 2018 to August 2023 and including 239 questions. To maintain objectivity, both the original questions and responses were anonymized and randomized. An evaluation team, consisting of three chief physicians, assessed the responses across four dimensions: relevance, accuracy, usefulness, and empathy. In total, 717 evaluations were conducted. The team initially identified the best response and then employed a Likert scale with five response categories to gauge the responses, each representing a distinct level of quality. Finally, a comparative analysis was conducted to compare the responses obtained from various sources.

RESULTS

Among the 717 evaluations conducted, 46.86% (95% CI, 43.21%–50.51%) of assessors displayed varying preferences for responses from physicians, with 34.87% (95% CI, 31.38%–38.36%) favoring ChatGPT and 18.27% (95% CI, 15.44%–21.10%) favoring ERNIE Bot. The average relevance scores for physicians, ChatGPT, and ERNIE Bot were 3.75 (95% CI, 3.69–3.82), 3.69 (95% CI, 3.63–3.74), and 3.41 (95% CI, 3.35–3.46), respectively. Regarding accuracy ratings, physicians (3.66, 95% CI, 3.60–3.73) and ChatGPT (3.73, 95% CI, 3.69–3.77) outperformed ERNIE Bot (3.52, 95%CI, 3.47–3.57). In terms of usefulness scores, physicians (3.54, 95% CI, 3.47–3.62) received higher ratings than ChatGPT (3.40, 95% CI, 3.34–3.47) and ERNIE Bot (3.05, 95% CI, 2.99–3.12). Finally, concerning the empathy dimension, ChatGPT (3.64, 95% CI, 3.57–3.71) outperformed physicians (3.13, 95% CI, 3.04–3.21) and ERNIE Bot (3.11, 95% CI, 3.04–3.18).

CONCLUSIONS

In this cross-sectional study, physicians' responses exhibited overall superiority in the present Chinese language context. Nonetheless, LLMs can provide valuable medical guidance to patients with autism and may even surpass physicians in terms of demonstrating empathy. However, it is crucial to acknowledge that further optimization and research are imperative prerequisites before the effective integration of LLMs in clinical settings across diverse linguistic environments can be realized.

CLINICALTRIAL

The study was registered on chictr.org (ChiCTR2300074655)

Publisher

JMIR Publications Inc.

Reference35 articles.

1. Computational Social Science and Sociology

2. Deep Learning Assisted Detection of Abdominal Free Fluid in Morison's Pouch During Focused Assessment With Sonography in Trauma

3. Evaluation of a Deep Neural Network for Automated Classification of Colorectal Polyps on Histopathologic Slides

4. Developing a delivery science for artificial intelligence in healthcare

5. Multilingual multi-aspect explainability analyses on machine reading comprehension models