Affiliation:
1. Institute of Information Science, Beijing Key Laboratory of Advanced Information Science and Network Technology, Beijing Jiaotong University, China
Abstract
Talking head, driving a source image to generate a talking video using other modality information, has made great progress in recent years. However, there are two main issues: 1) These methods are designed to utilize a single modality of information. 2) Most methods cannot control head pose. To address these problems, we propose a novel framework that can utilize multi-modal information to generate a talking head video, while achieving arbitrary head pose control by a movement sequence. Specifically, first, to extend driving information to multiple modalities, multi-modal information is encoded to a unified semantic latent space to generate expression parameters. Secondly, to disentangle attributes, the 3D Morphable Model (3DMM) is utilized to obtain identity information from the source image, and translation and rotation information from the target image. Thirdly, to control head pose and mouth shape, the source image is warped by a motion field generated by the expression parameter, translation parameter, and angle parameter. Finally, all the above parameters are utilized to render a landmark map, and the warped source image is combined with the landmark map to generate a delicate talking head video. Our experimental results demonstrate that our proposed method is capable of achieving state-of-the-art performance in terms of visual quality, lip-audio synchronization, and head pose control.
Publisher
Association for Computing Machinery (ACM)
Reference55 articles.
1. Volker Blanz and Thomas Vetter. 1999. A morphable model for the synthesis of 3D faces. In Siggraph. 187–194.
2. Efficient Geometry-aware 3D Generative Adversarial Networks
3. J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman. 2017. Lip Reading Sentences in the Wild. In IEEE Conference on Computer Vision and Pattern Recognition.
4. Joon Son Chung and Andrew Zisserman. 2016. Out of time: automated lip sync in the wild. In Asian conference on computer vision. Springer, 251–263.
5. Capture, Learning, and Synthesis of 3D Speaking Styles