Evaluation Metrics for Generative Models: An Empirical Study-Reference-Cited by-同舟云学术

Evaluation Metrics for Generative Models: An Empirical Study

Published:2024-07-07 Issue:3 Volume:6 Page:1531-1544
ISSN:2504-4990
Container-title:Machine Learning and Knowledge Extraction
language:en
Short-container-title:MAKE

Author:

Betzalel Eyal¹^ORCID,Penso Coby¹,Fetaya Ethan¹^ORCID

Affiliation:

1. Faculty of Electrical and Computer Engineering, Bar-Ilan University, Ramat-Gan 5290002, Israel

Abstract

Generative models such as generative adversarial networks, diffusion models, and variational auto-encoders have become prevalent in recent years. While it is true that these models have shown remarkable results, evaluating their performance is challenging. This issue is of vital importance to push research forward and identify meaningful gains from random noise. Currently, heuristic metrics such as the inception score (IS) and Fréchet inception distance (FID) are the most common evaluation metrics, but what they measure is not entirely clear. Additionally, there are questions regarding how meaningful their score actually is. In this work, we propose a novel evaluation protocol for likelihood-based generative models, based on generating a high-quality synthetic dataset on which we can estimate classical metrics for comparison. This new scheme harnesses the advantages of knowing the underlying likelihood values of the data by measuring the divergence between the model-generated data and the synthetic dataset. Our study shows that while FID and IS correlate with several f-divergences, their ranking of close models can vary considerably, making them problematic when used for fine-grained comparison. We further use this experimental setting to study which evaluation metric best correlates with our probabilistic metrics.

Funder

Israeli Council for Higher Education, Data Science Program

Publisher

MDPI AG

Link

https://www.mdpi.com/2504-4990/6/3/73/pdf

Reference34 articles.

1. Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A.C., and Bengio, Y. (2014, January 8–13). Generative Adversarial Nets. Proceedings of the Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada.

2. Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., and Aila, T. (2020, January 14–19). Analyzing and Improving the Image Quality of StyleGAN. Proceedings of the Conference on Computer Vision and Pattern Recognition, CVPR, Virtually.

3. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. (2022). Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv.

4. Kong, J., Kim, J., and Bae, J. (2020, January 6–12). HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Virtually.

5. Ranaldi, L., and Pucci, G. (2023). Knowing Knowledge: Epistemological Study of Knowledge in Transformers. Appl. Sci., 13.