Abstract
Model organisms such asDrosophila melanogasterare extremely well suited to performing large-scale screens, which often require the assessment of phenotypes in a target tissue (e.g., wing and eye). Currently, the annotation of defects is either performed manually, which hinders throughput and reproducibility, or based on dedicated image analysis pipelines, which are tailored to detect only specific defects. Here, we assess the potential of Vision Language Models (VLMs) to automatically detect aberrant phenotypes in a dataset ofDrosophilawings and provide their descriptions. We compare the performance of one the current most advanced multimodal models (GPT-4) with an open-source alternative (LLaVA). Via a thorough quantitative evaluation, we identify strong performances in the identification of aberrant wing phenotypes when providing the VLMs with just a single reference image. GPT-4 showed the best performance in terms of generating textual descriptions, being able to correctly describe complex wing phenotypes. We also provide practical advice on potential prompting strategies and highlight current limitations of these tools, especially around misclassification and generation of false information, which should be carefully taken into consideration if these tools are used as part of an image analysis pipeline.
Publisher
Cold Spring Harbor Laboratory