Abstract
AbstractBackgroundGenerative AI models that can produce photorealistic images from text descriptions have many applications in medicine, including medical education and synthetic data. However, it can be challenging to evaluate and compare their range of heterogeneous outputs, and thus there is a need for a systematic approach enabling image and model comparisons.MethodsWe develop an error classification system for annotating errors in AI-generated photorealistic images of humans and apply our method to a corpus of 240 images generated with three different models (DALL-E 3, Stable Diffusion XL and Stable Cascade) using 10 prompts with 8 images per prompt. The error classification system identifies five different error types with three different severities across five anatomical regions and specifies an associated quantitative scoring method based on aggregated proportions of errors per expected count of anatomical components for the generated image. We assess inter-rater agreement by double-annotating 25% of the images and calculating Krippendorf’s alpha and compare results across the three models and ten prompts quantitatively using a cumulative score per image.FindingsThe error classification system, accompanying training manual, generated image collection, annotations, and all associated scripts are available from our GitHub repository athttps://github.com/hastingslab-org/ai-human-images. Inter-rater agreement was relatively poor, reflecting the subjectivity of the error classification task. Model comparisons revealed DALL-E 3 performed consistently better than Stable Diffusion, however, the latter generated images reflecting more diversity in personal attributes. Images with groups of people were more challenging for all the models than individuals or pairs; some prompts were challenging for all models.InterpretationOur method enables systematic comparison of AI-generated photorealistic images of humans; our results can serve to catalyse improvements in these models for medical applications.FundingThis study received support from the University of Zurich’s Digital Society Initiative, and the Swiss National Science Foundation under grant agreement 209510.Research in contextEvidence before this studyThe authors searched PubMed and Google Scholar to find publications evaluating text-to-image model outputs for medical applications between 2014 (when generative adversarial networks first become available) and 2024. While the bulk of evaluations focused on task-specific networks generating single types of medical image, a few evaluations emerged exploring the novel general-purpose text-to-image diffusion models more broadly for applications in medical education and synthetic data generation. However, no previous work attempts to develop a systematic approach to evaluate these models’ representations of human anatomy.Added value of this studyWe present an anatomical error classification system, the first systematic approach to evaluate AI-generated images of humans that enables model and prompt comparisons. We apply our method to a corpus of generated images to compare state of the art large-scale models DALL-E 3 and two models from the Stable Diffusion family.Implications of all the available evidenceWhile our approach enables systematic comparisons, it remains limited by subjectivity and is labour-intensive for images with many represented figures. Future research should explore automation of some aspects of the evaluation through coupled segmentation and classification models.
Publisher
Cold Spring Harbor Laboratory