ProSpect: Prompt Spectrum for Attribute-Aware Personalization of Diffusion Models-Reference-Cited by-同舟云学术

ProSpect: Prompt Spectrum for Attribute-Aware Personalization of Diffusion Models

Published:2023-12-05 Issue:6 Volume:42 Page:1-14
ISSN:0730-0301
Container-title:ACM Transactions on Graphics
language:en
Short-container-title:ACM Trans. Graph.

Author:

Zhang Yuxin¹^ORCID,Dong Weiming¹^ORCID,Tang Fan²^ORCID,Huang Nisha³,Huang Haibin⁴^ORCID,Ma Chongyang⁴^ORCID,Lee Tong-Yee⁵^ORCID,Deussen Oliver⁶^ORCID,Xu Changsheng¹^ORCID

Affiliation:

1. MAIS, Institute of Automation, CAS, China and School of Artificial Intelligence, UCAS, China

2. Institute of Computing Technology, CAS, China

3. School of Artificial Intelligence, UCAS, China and MAIS, Institute of Automation, CAS, China

4. Kuaishou Technology, China

5. National Cheng-Kung University, Taiwan

6. University of Konstanz, Germany

Abstract

Personalizing generative models offers a way to guide image generation with user-provided references. Current personalization methods can invert an object or concept into the textual conditioning space and compose new natural sentences for text-to-image diffusion models. However, representing and editing specific visual attributes such as material, style, and layout remains a challenge, leading to a lack of disentanglement and editability. To address this problem, we propose a novel approach that leverages the step-by-step generation process of diffusion models, which generate images from low to high frequency information, providing a new perspective on representing, generating, and editing images. We develop the Prompt Spectrum Space P*, an expanded textual conditioning space, and a new image representation method called ProSpect. ProSpect represents an image as a collection of inverted textual token embeddings encoded from per-stage prompts, where each prompt corresponds to a specific generation stage (i.e., a group of consecutive steps) of the diffusion model. Experimental results demonstrate that P* and ProSpect offer better disentanglement and controllability compared to existing methods. We apply ProSpect in various personalized attribute-aware image generation applications, such as image-guided or text-driven manipulations of materials, style, and layout, achieving previously unattainable results from a single image input without fine-tuning the diffusion models. Our source code is available at https://github.com/zyxElsa/ProSpect.

Funder

National Natural Science Foundation of China

Deutsche Forschungsgemeinschaft

Beijing Natural Science Foundation

National Key R&D Program of China

National Science and Technology Council

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Graphics and Computer-Aided Design

Link

https://dl.acm.org/doi/pdf/10.1145/3618342

Reference72 articles.

1. Art Institute of Chicago. 2023. https://www.artic.edu/ Last accessed on 2023-09-12. Art Institute of Chicago. 2023. https://www.artic.edu/ Last accessed on 2023-09-12.

2. Omri Avrahami , Dani Lischinski , and Ohad Fried . 2022 . Blended Diffusion for Text-Driven Editing of Natural Images. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 18208--18218 . Omri Avrahami, Dani Lischinski, and Ohad Fried. 2022. Blended Diffusion for Text-Driven Editing of Natural Images. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 18208--18218.

3. Yogesh Balaji , Seungjun Nah , Xun Huang , Arash Vahdat , Jiaming Song , Karsten Kreis , Miika Aittala , Timo Aila , Samuli Laine , Bryan Catanzaro , Tero Karras , and Ming-Yu Liu . 2022. eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers. arXiv preprint arXiv:2211.01324 ( 2022 ). Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, Tero Karras, and Ming-Yu Liu. 2022. eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers. arXiv preprint arXiv:2211.01324 (2022).

4. David Bau , Alex Andonian , Audrey Cui , YeonHwan Park , Ali Jahanian , Aude Oliva , and Antonio Torralba . 2021. Paint by word. arXiv preprint arXiv:2103.10951 ( 2021 ). David Bau, Alex Andonian, Audrey Cui, YeonHwan Park, Ali Jahanian, Aude Oliva, and Antonio Torralba. 2021. Paint by word. arXiv preprint arXiv:2103.10951 (2021).

5. Andrew Brock , Jeff Donahue , and Karen Simonyan . 2019 . Large Scale GAN Training for High Fidelity Natural Image Synthesis. In International Conference on Learning Representations (ICLR). Andrew Brock, Jeff Donahue, and Karen Simonyan. 2019. Large Scale GAN Training for High Fidelity Natural Image Synthesis. In International Conference on Learning Representations (ICLR).

Cited by 11 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Multi-granularity siamese transformer-based change detection in remote sensing imagery;Engineering Applications of Artificial Intelligence;2024-10

2. DiffCAD: Weakly-Supervised Probabilistic CAD Model Retrieval and Alignment from an RGB Image;ACM Transactions on Graphics;2024-07-19

3. Progressive Dynamics for Cloth and Shell Animation;ACM Transactions on Graphics;2024-07-19

4. Creating LEGO Figurines from Single Images;ACM Transactions on Graphics;2024-07-19

5. Media2Face: Co-speech Facial Animation Generation With Multi-Modality Guidance;Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers '24;2024-07-13