Audio-guided implicit neural representation for local image stylization-Reference-Cited by-同舟云学术

Audio-guided implicit neural representation for local image stylization

Published:2024-08-14 Issue: Volume: Page:
ISSN:2096-0433
Container-title:Computational Visual Media
language:en
Short-container-title:Comp. Visual Media

Author:

Lee Seung Hyun,Kim Sieun,Byeon Wonmin,Oh Gyeongrok,In Sumin,Park Hyeongcheol,Yoon Sang Ho,Hong Sung-Hee,Kim Jinkyu,Kim Sangpil

Abstract

AbstractWe present a novel framework for audio-guided localized image stylization. Sound often provides information about the specific context of a scene and is closely related to a certain part of the scene or object. However, existing image stylization works have focused on stylizing the entire image using an image or text input. Stylizing a particular part of the image based on audio input is natural but challenging. This work proposes a framework in which a user provides an audio input to localize the target in the input image and another to locally stylize the target object or scene. We first produce a fine localization map using an audio-visual localization network leveraging CLIP embedding space. We then utilize an implicit neural representation (INR) along with the predicted localization map to stylize the target based on sound information. The INR manipulates local pixel values to be semantically consistent with the provided audio input. Our experiments show that the proposed framework outperforms other audio-guided stylization methods. Moreover, we observe that our method constructs concise localization maps and naturally manipulates the target object or scene in accordance with the given audio input.

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1007/s41095-024-0413-5.pdf

Reference55 articles.

1. Lee, S. H.; Roh, W.; Byeon, W.; Yoon, S. H.; Kim, C. Y.; Kim, J.; Kim, S.; Lee, S. H.; Oh, G.; Byeon, W.; et al. Sound-guided semantic image manipulation. arXiv preprint arXiv:2112.00007, 2021.

2. Li, T.; Liu, Y.; Owens, A.; Zhao, H. Learning visual styles from audio-visual associations. In: Computer Vision – ECCV 2022. Lecture Notes in Computer Science, Vol. 13697. Avidan, S.; Brostow, G.; Cissé, M.; Farinella, G. M.; Hassner, T. Eds. Springer Cham, 235–252, 2022.

3. Lee, S. H.; Oh, G.; Byeon, W.; Yoon, S. H.; Kim, J.; Kim, S. Robust sound-guided image manipulation. arXiv preprint arXiv:2208.14114, 2022.

4. Kurzman, L.; Vazquez, D.; Laradji, I. Class-based styling: Real-time localized style transfer with semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshop, 2019.

5. Castillo, C.; De, S.; Han, X.; Singh, B.; Yadav, A. K.; Goldstein, T. Son of Zorn’s lemma: Targeted style transfer using instance-aware semantic segmentation. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 1348–1352, 2017.