BACKGROUND
Effective dermatological practices are essential for maintaining healthy skin and addressing health concerns such as acne. Traditional dermatological consultations often face limitations, including accessibility, cost, and variability in expertise. The potential application of Large Language Models (LLMs) in healthcare, particularly for dermatology, is an area of growing interest.
OBJECTIVE
This study aims to evaluate the feasibility of utilizing various mainstream LLMs, including GPT-3.5-turbo, GPT-4, and GPT-4o, as consultants for acne-related health concerns. The primary objective is to determine whether these models can deliver accurate, relevant, and competitive solutions in line with professional dermatological standards. Additionally, the study compares the performance of these models, to identify which offers the best overall performance for acne management consultations.
METHODS
Real human question data were sourced from major social media platforms, health forums, and dermatology clinics. Personally Identifiable Information (PII) filtering was applied when selecting 37 related questions to ensure privacy compliances. Each LLM generated responses three times. An automated evaluation system using GPT-4o assessed the responses based on ten criteria: accuracy of terminology, evidence support, factual correctness, completeness, ethical considerations, practicality, safety advice, tone, personalization, and up-to-date information. Each response was categorized as Pass, Fail, or Ignore.
RESULTS
The evaluation revealed high performance of GPT-3.5-turbo, GPT-4, and GPT-4o in accuracy, with pass rates for correct terminology (95.20%), evidence support (99.10%), and factual correctness (96.10%). Limitations were observed in personalization (24.62%), safety advice (76.88%), and up-to-date information (75.98%). A comparative analysis showed that GPT-4 generally outperformed GPT-3.5-turbo and GPT-4o in most criteria, achieving higher completeness (91.89%) and ethical considerations (97.30%). GPT-4 excelled in tailoring recommendations to individual profiles. In contrast, GPT-4o demonstrated the highest accuracy.
CONCLUSIONS
All models demonstrated strong performance in accuracy, ethical considerations, and providing polite and respectful responses. However, the models showed limitations in personalization and safety advice. Overall, GPT models at the current stage demonstrated the capability in initial dermatological consultations and assist patient as a self-explanatory tool, but ongoing enhancements are necessary to address their current shortcomings and ensure even more reliable and effective performance in the future.