CLIP knows image aesthetics-Reference-Cited by-同舟云学术

CLIP knows image aesthetics

Published:2022-11-25 Issue: Volume:5 Page:
ISSN:2624-8212
Container-title:Frontiers in Artificial Intelligence
language:
Short-container-title:Front. Artif. Intell.

Author:

Hentschel Simon,Kobs Konstantin,Hotho Andreas

Abstract

Most Image Aesthetic Assessment (IAA) methods use a pretrained ImageNet classification model as a base to fine-tune. We hypothesize that content classification is not an optimal pretraining task for IAA, since the task discourages the extraction of features that are useful for IAA, e.g., composition, lighting, or style. On the other hand, we argue that the Contrastive Language-Image Pretraining (CLIP) model is a better base for IAA models, since it has been trained using natural language supervision. Due to the rich nature of language, CLIP needs to learn a broad range of image features that correlate with sentences describing the image content, composition, environments, and even subjective feelings about the image. While it has been shown that CLIP extracts features useful for content classification tasks, its suitability for tasks that require the extraction of style-based features like IAA has not yet been shown. We test our hypothesis by conducting a three-step study, investigating the usefulness of features extracted by CLIP compared to features obtained from the last layer of a comparable ImageNet classification model. In each step, we get more computationally expensive. First, we engineer natural language prompts that let CLIP assess an image's aesthetic without adjusting any weights in the model. To overcome the challenge that CLIP's prompting only is applicable to classification tasks, we propose a simple but effective strategy to convert multiple prompts to a continuous scalar as required when predicting an image's mean aesthetic score. Second, we train a linear regression on the AVA dataset using image features obtained by CLIP's image encoder. The resulting model outperforms a linear regression trained on features from an ImageNet classification model. It also shows competitive performance with fully fine-tuned networks based on ImageNet, while only training a single layer. Finally, by fine-tuning CLIP's image encoder on the AVA dataset, we show that CLIP only needs a fraction of training epochs to converge, while also performing better than a fine-tuned ImageNet model. Overall, our experiments suggest that CLIP is better suited as a base model for IAA methods than ImageNet pretrained networks.

Publisher

Frontiers Media SA

Subject

Artificial Intelligence

Reference32 articles.

1. “Conceptual 12M: pushing web-scale image-text pre-training to recognize long-tail visual concepts,”;Changpinyo;CVPR,2021

2. “Vqgan-clip: Open domain image generation and editing with natural language guidance,”;Crowson,2022

3. “ImageNet: a large-scale hierarchical image database,”;Deng,2009

4. An image is worth 16x16 words: transformers for image recognition at scale;Dosovitskiy;arXiv preprint arXiv:2010.11929,2021

Cited by 9 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. MemoVis: A GenAI-Powered Tool for Creating Companion Reference Images for 3D Design Feedback;ACM Transactions on Computer-Human Interaction;2024-09-04

2. Machine learning driven image segmentation and shape clustering of algal microscopic images obtained from various water types;2024-04-13

3. Artificial Intelligence in Medical Imaging: Analyzing the Performance of ChatGPT and Microsoft Bing in Scoliosis Detection and Cobb Angle Assessment;Diagnostics;2024-04-05

4. Perceptual Image Quality Prediction: Are Contrastive Language–Image Pretraining (CLIP) Visual Features Effective?;Electronics;2024-02-19

5. Mmiaa: Missing Modality Image Aesthetic Assessment with Digest Mechanism;2024