Affiliation:
1. Nanyang Technological University, Singapore
2. Nanyang Technological University, Singapore and SenseTime Research, China
3. SenseTime Research, China
Abstract
3D avatar creation plays a crucial role in the digital age. However, the whole production process is prohibitively time-consuming and labor-intensive. To democratize this technology to a larger audience, we propose AvatarCLIP, a zero-shot text-driven framework for 3D avatar generation and animation. Unlike professional software that requires expert knowledge, AvatarCLIP empowers layman users to customize a 3D avatar with the desired shape and texture, and drive the avatar with the described motions using solely natural languages. Our key insight is to take advantage of the powerful vision-language model CLIP for supervising neural human generation, in terms of 3D geometry, texture and animation. Specifically, driven by natural language descriptions, we initialize 3D human geometry generation with a shape VAE network. Based on the generated 3D human shapes, a volume rendering model is utilized to further facilitate geometry sculpting and texture generation. Moreover, by leveraging the priors learned in the motion VAE, a CLIP-guided reference-based motion synthesis method is proposed for the animation of the generated 3D avatar. Extensive qualitative and quantitative experiments validate the effectiveness and generalizability of AvatarCLIP on a wide range of avatars. Remarkably, AvatarCLIP can generate unseen 3D avatars with novel animations, achieving superior zero-shot capability. Codes are available at https://github.com/hongfz16/AvatarCLIP.
Funder
RIE2020 Industry Alignment Fund Industry Collaboration Projects (IAF-ICP) Funding Initiative
NTU NAP, MOE AcRF Tier 2
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Graphics and Computer-Aided Design
Reference87 articles.
1. Gunjan Aggarwal and Devi Parikh . 2021. Dance2Music: Automatic Dance-driven Music Generation. arXiv preprint arXiv:2107.06252 ( 2021 ). Gunjan Aggarwal and Devi Parikh. 2021. Dance2Music: Automatic Dance-driven Music Generation. arXiv preprint arXiv:2107.06252 (2021).
2. Text2Action: Generative Adversarial Synthesis from Language to Action
3. Language2Pose: Natural Language Grounded Pose Forecasting
4. Video Based Reconstruction of 3D People Models
5. DReCon
Cited by
84 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献