Author:
Zhang Zhimeng,Hu Zhipeng,Deng Wenjin,Fan Changjie,Lv Tangjie,Ding Yu
Abstract
For few-shot learning, it is still a critical challenge to realize photo-realistic face visually dubbing on high-resolution videos. Previous works fail to generate high-fidelity dubbing results. To address the above problem, this paper proposes a Deformation Inpainting Network (DINet) for high-resolution face visually dubbing. Different from previous works relying on multiple up-sample layers to directly generate pixels from latent embeddings, DINet performs spatial deformation on feature maps of reference images to better preserve high-frequency textural details. Specifically, DINet consists of one deformation part and one inpainting part. In the first part, five reference facial images adaptively perform spatial deformation to create deformed feature maps encoding mouth shapes at each frame, in order to align with input driving audio and also the head poses of input source images. In the second part, to produce face visually dubbing, a feature decoder is responsible for adaptively incorporating mouth movements from the deformed feature maps and other attributes (i.e., head pose and upper facial expression) from the source feature maps together. Finally, DINet achieves face visually dubbing with rich textural details. We conduct qualitative and quantitative comparisons to validate our DINet on high-resolution videos. The experimental results show that our method outperforms state-of-the-art works.
Publisher
Association for the Advancement of Artificial Intelligence (AAAI)
Cited by
7 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Embedded Representation Learning Network for Animating Styled Video Portrait;2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG);2024-05-27
2. DiffDub: Person-Generic Visual Dubbing Using Inpainting Renderer with Diffusion Auto-Encoder;ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP);2024-04-14
3. RADIO: Reference-Agnostic Dubbing Video Synthesis;2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV);2024-01-03
4. HDTR-Net: A Real-Time High-Definition Teeth Restoration Network for Arbitrary Talking Face Generation Methods;Pattern Recognition and Computer Vision;2023-12-28
5. SVMFI: speaker video multi-frame interpolation with the guidance of audio;Multimedia Tools and Applications;2023-12-12