A Study on Webtoon Generation Using CLIP and Diffusion Models

Author:

Yu Kyungho1ORCID,Kim Hyoungju2ORCID,Kim Jeongin3,Chun Chanjun1ORCID,Kim Pankoo1

Affiliation:

1. Department of Computer Engineering, Chosun University, 309 Pilmun-Daero, Dong-Gu, Gwangju 61452, Republic of Korea

2. Institute of AI Convergence, Chosun University, 309 Pilmun-Daero, Dong-Gu, Gwangju 61452, Republic of Korea

3. Department of Microbiology and Immunology, Chosun University School of Dentistry, 309 Pilmun-Daero, Dong-Gu, Gwangju 61452, Republic of Korea

Abstract

This study focuses on harnessing deep-learning-based text-to-image transformation techniques to help webtoon creators’ creative outputs. We converted publicly available datasets (e.g., MSCOCO) into a multimodal webtoon dataset using CartoonGAN. First, the dataset was leveraged for training contrastive language image pre-training (CLIP), a model composed of multi-lingual BERT and a Vision Transformer that learnt to associate text with images. Second, a pre-trained diffusion model was employed to generate webtoons through text and text-similar image input. The webtoon dataset comprised treatments (i.e., textual descriptions) paired with their corresponding webtoon illustrations. CLIP (operating through contrastive learning) extracted features from different data modalities and aligned similar data more closely within the same feature space while pushing dissimilar data apart. This model learnt the relationships between various modalities in multimodal data. To generate webtoons using the diffusion model, the process involved providing the CLIP features of the desired webtoon’s text with those of the most text-similar image to a pre-trained diffusion model. Experiments were conducted using both single- and continuous-text inputs to generate webtoons. In the experiments, both single-text and continuous-text inputs were used to generate webtoons, and the results showed an inception score of 7.14 when using continuous-text inputs. The text-to-image technology developed here could streamline the webtoon creation process for artists by enabling the efficient generation of webtoons based on the provided text. However, the current inability to generate webtoons from multiple sentences or images while maintaining a consistent artistic style was noted. Therefore, further research is imperative to develop a text-to-image model capable of handling multi-sentence and -lingual input while ensuring coherence in the artistic style across the generated webtoon images.

Funder

Chosun University

‘Technology Commercialization Collaboration Platform Construction’ project of the INNOPOLIS FOUNDATION

Publisher

MDPI AG

Subject

Electrical and Electronic Engineering,Computer Networks and Communications,Hardware and Architecture,Signal Processing,Control and Systems Engineering

Reference22 articles.

1. A survey and taxonomy of adversarial neural networks for text-to-image synthesis;Agnese;Wiley Interdiscip. Rev. Data Min. Knowl. Discov.,2020

2. Wang, S., Zeng, W., Wang, X., Yang, H., Chen, L., Zhang, C., and Liu, J. (2023, January 7–14). SwiftAvatar: Efficient Auto-Creation of Parameterized Stylized Character on Arbitrary Avatar Engines. Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA.

3. A Survey on Deep Learning for Skeleton-Based Human Animation;Mourot;Comput. Graph. Forum,2022

4. Multimodal learning with transformers: A survey;Xu;IEEE Trans. Pattern Anal. Mach. Intell.,2023

5. Multimodal machine learning: A survey and taxonomy;Ahuja;IEEE Trans. Pattern Anal. Mach. Intell.,2018

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3