A Study on Webtoon Generation Using CLIP and Diffusion Models-Reference-Cited by-同舟云学术

A Study on Webtoon Generation Using CLIP and Diffusion Models

Published:2023-09-21 Issue:18 Volume:12 Page:3983
ISSN:2079-9292
Container-title:Electronics
language:en
Short-container-title:Electronics

Author:

Yu Kyungho¹^ORCID,Kim Hyoungju²^ORCID,Kim Jeongin³,Chun Chanjun¹^ORCID,Kim Pankoo¹

Affiliation:

1. Department of Computer Engineering, Chosun University, 309 Pilmun-Daero, Dong-Gu, Gwangju 61452, Republic of Korea

2. Institute of AI Convergence, Chosun University, 309 Pilmun-Daero, Dong-Gu, Gwangju 61452, Republic of Korea

3. Department of Microbiology and Immunology, Chosun University School of Dentistry, 309 Pilmun-Daero, Dong-Gu, Gwangju 61452, Republic of Korea

Abstract

This study focuses on harnessing deep-learning-based text-to-image transformation techniques to help webtoon creators’ creative outputs. We converted publicly available datasets (e.g., MSCOCO) into a multimodal webtoon dataset using CartoonGAN. First, the dataset was leveraged for training contrastive language image pre-training (CLIP), a model composed of multi-lingual BERT and a Vision Transformer that learnt to associate text with images. Second, a pre-trained diffusion model was employed to generate webtoons through text and text-similar image input. The webtoon dataset comprised treatments (i.e., textual descriptions) paired with their corresponding webtoon illustrations. CLIP (operating through contrastive learning) extracted features from different data modalities and aligned similar data more closely within the same feature space while pushing dissimilar data apart. This model learnt the relationships between various modalities in multimodal data. To generate webtoons using the diffusion model, the process involved providing the CLIP features of the desired webtoon’s text with those of the most text-similar image to a pre-trained diffusion model. Experiments were conducted using both single- and continuous-text inputs to generate webtoons. In the experiments, both single-text and continuous-text inputs were used to generate webtoons, and the results showed an inception score of 7.14 when using continuous-text inputs. The text-to-image technology developed here could streamline the webtoon creation process for artists by enabling the efficient generation of webtoons based on the provided text. However, the current inability to generate webtoons from multiple sentences or images while maintaining a consistent artistic style was noted. Therefore, further research is imperative to develop a text-to-image model capable of handling multi-sentence and -lingual input while ensuring coherence in the artistic style across the generated webtoon images.

Funder

Chosun University

‘Technology Commercialization Collaboration Platform Construction’ project of the INNOPOLIS FOUNDATION

Publisher

MDPI AG

Subject

Electrical and Electronic Engineering,Computer Networks and Communications,Hardware and Architecture,Signal Processing,Control and Systems Engineering

Link

https://www.mdpi.com/2079-9292/12/18/3983/pdf

Reference22 articles.

1. A survey and taxonomy of adversarial neural networks for text-to-image synthesis;Agnese;Wiley Interdiscip. Rev. Data Min. Knowl. Discov.,2020

2. Wang, S., Zeng, W., Wang, X., Yang, H., Chen, L., Zhang, C., and Liu, J. (2023, January 7–14). SwiftAvatar: Efficient Auto-Creation of Parameterized Stylized Character on Arbitrary Avatar Engines. Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA.

3. A Survey on Deep Learning for Skeleton-Based Human Animation;Mourot;Comput. Graph. Forum,2022

4. Multimodal learning with transformers: A survey;Xu;IEEE Trans. Pattern Anal. Mach. Intell.,2023

5. Multimodal machine learning: A survey and taxonomy;Ahuja;IEEE Trans. Pattern Anal. Mach. Intell.,2018