Video Scene Detection Using Transformer Encoding Linker Network (TELNet)
Author:
Tseng Shu-Ming1ORCID, Yeh Zhi-Ting2, Wu Chia-Yang1, Chang Jia-Bin2ORCID, Norouzi Mehdi2ORCID
Affiliation:
1. Department of Electronic Engineering, National Taipei University of Technology, Taipei 106335, Taiwan 2. College of Engineering and Applied Science, University of Cincinnati, Cincinnati, OH 45219, USA
Abstract
This paper introduces a transformer encoding linker network (TELNet) for automatically identifying scene boundaries in videos without prior knowledge of their structure. Videos consist of sequences of semantically related shots or chapters, and recognizing scene boundaries is crucial for various video processing tasks, including video summarization. TELNet utilizes a rolling window to scan through video shots, encoding their features extracted from a fine-tuned 3D CNN model (transformer encoder). By establishing links between video shots based on these encoded features (linker), TELNet efficiently identifies scene boundaries where consecutive shots lack links. TELNet was trained on multiple video scene detection datasets and demonstrated results comparable to other state-of-the-art models in standard settings. Notably, in cross-dataset evaluations, TELNet demonstrated significantly improved results (F-score). Furthermore, TELNet’s computational complexity grows linearly with the number of shots, making it highly efficient in processing long videos.
Funder
National Science and Technology Council, Taiwan University of Cincinnati, Cincinnati, OH
Subject
Electrical and Electronic Engineering,Biochemistry,Instrumentation,Atomic and Molecular Physics, and Optics,Analytical Chemistry
Reference47 articles.
1. Wang, H., Neumann, J., and Choi, J. (2021). Determining Video Highlights and Chaptering. (11,172,272), U.S. Patent. 2. Jindal, A., and Bedi, A. (2020). Extracting Session Information from Video Content to Facilitate Seeking. (10,701,434), U.S. Patent. 3. Otani, M., Nakashima, Y., Rahtu, E., and Heikkila, J. (2019, January 15–20). Rethinking the evaluation of video summaries. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA. 4. Constructing table-of-content for videos;Rui;Multimed. Syst.,1999 5. Video shot detection and condensed representation. a review;Cotsaces;IEEE Signal Process. Mag.,2006
Cited by
5 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
|
|