EmoStyle: Emotion-Aware Semantic Image Manipulation with Audio Guidance
-
Published:2024-04-10
Issue:8
Volume:14
Page:3193
-
ISSN:2076-3417
-
Container-title:Applied Sciences
-
language:en
-
Short-container-title:Applied Sciences
Author:
Shen Qiwei1, Xu Junjie2, Mei Jiahao2ORCID, Wu Xingjiao3, Dong Daoguo2
Affiliation:
1. Software Engineering Institute, East China Normal University, Shanghai 200062, China 2. School of Computer Science and Technology, East China Normal University, Shanghai 200062, China 3. School of Computer Science, Fudan University, Shanghai 200433, China
Abstract
With the flourishing development of generative models, image manipulation is receiving increasing attention. Rather than text modality, several elegant designs have delved into leveraging audio to manipulate images. However, existing methodologies mainly focus on image generation conditional on semantic alignment, ignoring the vivid affective information depicted in the audio. We propose an Emotion-aware StyleGAN Manipulator (EmoStyle), a framework where affective information from audio can be explicitly extracted and further utilized during image manipulation. Specifically, we first leverage the multi-modality model ImageBind for initial cross-modal retrieval between images and music, and select the music-related image for further manipulation. Simultaneously, by extracting sentiment polarity from the lyrics of the audio, we generate an emotionally rich auxiliary music branch to accentuate the affective information. We then leverage pre-trained encoders to encode audio and the audio-related image into the same embedding space. With the aligned embeddings, we manipulate the image via a direct latent optimization method. We conduct objective and subjective evaluations on the generated images, and our results show that our framework is capable of generating images with specified human emotions conveyed in the audio.
Reference48 articles.
1. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., and Azhar, F. (2023). Llama: Open and efficient foundation language models. arXiv. 2. Karras, T., Laine, S., and Aila, T. (2019, January 15–20). A style-based generator architecture for generative adversarial networks. Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA. 3. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. (2022, January 18–24). High-resolution image synthesis with latent diffusion models. Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA. 4. Johnson, J., Gupta, A., and Li, F.-F. (2018, January 18–23). Image generation from scene graphs. Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA. 5. Chen, L., Srivastava, S., Duan, Z., and Xu, C. (2017, January 23–27). Deep cross-modal audio-visual generation. Proceedings of the ACM International Conference on Multimedia (ACM MM), Silicon Valley, CA, USA.
|
|