Author:
Shakirzyanov Marsel,Gibadullin Ruslan,Nuriyev Marat
Abstract
Deep learning and reinforcement learning technologies are opening up new possibilities for the automatic matching of video and audio data. This article explores the key steps in developing such a system, from matching phonemes and lip movements to selecting appropriate machine-learning models. It also discusses the importance of getting the reward function right, the balance between exploitation and exploitation, and the complexities of collecting training data. The article emphasizes the importance of using pre-trained models and transfer learning, and the importance of correctly evaluating and interpreting results to improve the system and achieve high-quality content. The article focuses on the need to develop effective mapping quality metrics and visualization methods to fully analyze system performance and identify possible areas for improvement.
Reference39 articles.
1. Lammert A.C., Proctor M.I., Narayanan S.S., Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010, 1572–1575 (2010)
2. Characterizing spoken responses in masked-onset priming of reading aloud using articulography
3. Fleet D.J., Weiss Y., Handbook of mathematical models in computer vision, 237–257 (2006)
4. TracTrac: A fast multi-object tracking algorithm for motion estimation
5. Lucas-Kanade 20 Years On: A Unifying Framework
Cited by
18 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献