Analogist: Out-of-the-box Visual In-Context Learning with Image Diffusion Model-Reference-Cited by-同舟云学术

Analogist: Out-of-the-box Visual In-Context Learning with Image Diffusion Model

Published:2024-07-19 Issue:4 Volume:43 Page:1-15
ISSN:0730-0301
Container-title:ACM Transactions on Graphics
language:en
Short-container-title:ACM Trans. Graph.

Author:

Gu Zheng¹²^ORCID,Yang Shiyuan³²^ORCID,Liao Jing²^ORCID,Huo Jing¹^ORCID,Gao Yang¹^ORCID

Affiliation:

1. Nanjing University, Nanjing, China

2. City University of Hong Kong, Hong Kong, Hong Kong

3. Tianjin University, Tianjin, China

Abstract

Visual In-Context Learning (ICL) has emerged as a promising research area due to its capability to accomplish various tasks with limited example pairs through analogical reasoning. However, training-based visual ICL has limitations in its ability to generalize to unseen tasks and requires the collection of a diverse task dataset. On the other hand, existing methods in the inference-based visual ICL category solely rely on textual prompts, which fail to capture fine-grained contextual information from given examples and can be time-consuming when converting from images to text prompts. To address these challenges, we propose Analogist, a novel inference-based visual ICL approach that exploits both visual and textual prompting techniques using a text-to-image diffusion model pretrained for image inpainting. For visual prompting, we propose a self-attention cloning (SAC) method to guide the fine-grained structural-level analogy between image examples. For textual prompting, we leverage GPT-4V's visual reasoning capability to efficiently generate text prompts and introduce a cross-attention masking (CAM) operation to enhance the accuracy of semantic-level analogy guided by text prompts. Our method is out-of-the-box and does not require fine-tuning or optimization. It is also generic and flexible, enabling a wide range of visual tasks to be performed in an in-context manner. Extensive experiments demonstrate the superiority of our method over existing approaches, both qualitatively and quantitatively. Our project webpage is available at https://analogist2d.github.io.

Funder

Hong Kong RGC General Research Fund

National Natural Science Foundation of China

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3658136

Reference46 articles.

1. Yutong Bai, Xinyang Geng, Karttikeya Mangalam, Amir Bar, Alan Yuille, Trevor Darrell, Jitendra Malik, and Alexei A Efros. 2023. Sequential Modeling Enables Scalable Learning for Large Vision Models. arXiv preprint arXiv:2312.00785 (2023).

2. Visual prompting via image inpainting;Bar Amir;Advances in Neural Information Processing Systems,2022

3. CLIP-guided StyleGAN Inversion for Text-driven Real Image Editing

4. James Betker Gabriel Goh Li Jing Tim Brooks Jianfeng Wang Linjie Li Long Ouyang Juntang Zhuang Joyce Lee Yufei Guo et al. 2023. Improving image generation with better captions. Computer Science. https://cdn.openai.com/papers/dall-e-3.pdf 2 (2023) 3.

5. InstructPix2Pix: Learning to Follow Image Editing Instructions