BACKGROUND
Background: The increased interest in AI tools such as large language models in the field of medicine, particularly nutrition, underscores the importance of evaluating their efficacy across various languages. While large language models such as ChatGPT-4 have showed competency in English, their performance in underrepresented languages such as Kazakh and Russian still needs to be investigated. Given the lack of non-English training data, it is critical to investigate the capabilities of ChatGPT-4 in providing specific nutritional recommendations across different languages.
OBJECTIVE
The research objective is to assess and evaluate how well ChatGPT-4 system can provide personalized, evidence-based and practical nutritional advice in English, Kazakh, and Russian.
METHODS
This study was conducted from May 15 to August 31, 2023. Fifty mock patient case studies were input into ChatGPT-4, which generated nutritional recommendations and diet plans. The quality of generated outputs for underrepresented languages (e.g. Russian and Kazakh) was enhanced through intermediate translation steps using Google Translate API. All responses were evaluated for personalization, consistency, and practicality using a 5-point Likert scale. To identify significant differences amongst the three languages, the Kruskal Wallis Test was conducted. Additional pairwise comparisons for each language were carried out using the Post-hoc Dunn's Test.
RESULTS
There were significant differences observed among the scores for the various outputs generated in three languages (p-value<0.0001). Whilst the performance of the ChatGPT-4 system was moderate across all categories for both English and Russian, the Kazakh outputs were not applicable for evaluation. For English outputs, the average scores were 3.32 ±0.46 for personalization category, 3.48 ±0.43 for consistency, and 3.25 ±0.41 for practicality & availability. For Russian, the average scores were slightly lower with 3.18 ±0.38 for personalization, 3.38 ±0.39 for consistency, and 3.37 ±0.38 for practicality & availability. As for the Kazakh language, all categories score just above 1. However, after the machine translation step, nutritional recommendations in Kazakh language improved. After machine translation, there were no significant differences among the outputs in the three languages.
CONCLUSIONS
These observations reveal that, even when employing the same prompts in three different languages, the ChatGPT-4 system's ability to generate coherent responses is limited due to insufficient training data in non-English languages. These findings suggest that the inclusion of non-English training datasets can be valuable for optimizing the performance of large language models. Moreover, this study underscores the potential of leveraging automated machine translation as a means to overcome the existing constraints in ChatGPT-system in providing dietary guidance to non-English-speaking populations.
CLINICALTRIAL
Not applicable