Audio-Oriented Video Interpolation Using Key Pose

Author:

Nakatsuka Takayuki1,Tsuchiya Yukitaka1,Hamanaka Masatoshi2,Morishima Shigeo3

Affiliation:

1. Department of Pure and Applied Physics, Waseda University, 55N Building, Room 406, 3-4-1 Okubo, Shinjuku, Tokyo 169-8555, Japan

2. RIKEN, Nihonbashi 1-Chome Mitsui Building, 15th Floor, 1-4-1 Nihonbashi, Chuo-KU, Tokyo 103-0027, Japan

3. Waseda Research Institute for Science and Engineering, 55N Building, Room 406, 3-4-1 Okubo, Shinjuku, Tokyo 169-8555, Japan

Abstract

This paper describes a deep learning-based method for long-term video interpolation that generates intermediate frames between two music performance videos of a person playing a specific instrument. Recent advances in deep learning techniques have successfully generated realistic images with high-fidelity and high-resolution in short-term video interpolation. However, there is still room for improvement in long-term video interpolation due to lack of resolution and temporal consistency of the generated video. Particularly in music performance videos, the music and human performance motion need to be synchronized. We solved these problems by using human poses and music features essential for music performance in long-term video interpolation. By closely matching human poses with music and videos, it is possible to generate intermediate frames that synchronize with the music. Specifically, we obtain the human poses of the last frame of the first video and the first frame of the second video in the performance videos to be interpolated as key poses. Then, our encoder–decoder network estimates the human poses in the intermediate frames from the obtained key poses, with the music features as the condition. In order to construct an end-to-end network, we utilize a differentiable network that transforms the estimated human poses in vector form into the human pose in image form, such as human stick figures. Finally, a video-to-video synthesis network uses the stick figures to generate intermediate frames between two music performance videos. We found that the generated performance videos were of higher quality than the baseline method through quantitative experiments.

Funder

Japan Society for the Promotion of Science

Accelerated Innovation Research Initiative Turning Top Science and Ideas into High-Impact Values

Japan Science and Technology Agency

Publisher

World Scientific Pub Co Pte Ltd

Subject

Artificial Intelligence,Computer Vision and Pattern Recognition,Software

Cited by 3 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Implementation of Melody Slot Machines;Lecture Notes in Computer Science;2024

2. Melody Slot Machine II: Sound Enhancement with Multimodal Interface;International Cconference on Multimodal Interaction;2023-10-09

3. An Automatic Music-Driven Folk Dance Movements Generation Method Based on Sequence-To-Sequence Network;International Journal of Pattern Recognition and Artificial Intelligence;2023-03-20

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3