Characters Link Shots: Character Attention Network for Movie Scene Segmentation-Reference-Cited by-同舟云学术

Characters Link Shots: Character Attention Network for Movie Scene Segmentation

Published:2023-12-11 Issue:4 Volume:20 Page:1-23
ISSN:1551-6857
Container-title:ACM Transactions on Multimedia Computing, Communications, and Applications
language:en
Short-container-title:ACM Trans. Multimedia Comput. Commun. Appl.

Author:

Tan Jiawei¹^ORCID,Wang Hongxing¹^ORCID,Yuan Junsong²^ORCID

Affiliation:

1. Key Laboratory of Dependable Service Computing in Cyber Physical Society (Chongqing University), Ministry of Education, China, and School of Big Data and Software Engineering, Chongqing University, China

2. Department of Computer Science and Engineering, State University of New York at Buffalo, USA

Abstract

Movie scene segmentation aims to automatically segment a movie into multiple story units, i.e., scenes, each of which is a series of semantically coherent and time-continual shots. Previous methods have continued efforts on shot semantic association, but few take into account the impact of different semantics on foreground characters and background scenes in movie shots. In particular, the background scene in the shot can adversely affect scene boundary classification. Motivated by the fact that it is the characters who drive the plot development of a movie scene, we build a Character Attention Network (CANet) to detect movie scene boundaries in a character-centric fashion. To eliminate the background clutter, we extract multi-view character semantics for each shot in terms of human bodies and faces. Furthermore, we equip our CANet with two stages of character attention. The first is Masked Shot Attention (MSA) through selective self-attention over similar temporal contexts from multi-view character semantics to yield an enhanced omni-view shot representation, by which the CANet can better handle the variations of characters in pose and appearance. The second is Key Character Attention (KCA) through temporal-aware attention on character reappearances for Bidirectional Long Short-Term Memory (Bi-LSTM) feature association so that linking shots can be focused on those with recurring key characters. We encourage the proposed CANet in learning boundary-discriminative shot features. Specifically, we formulate a Boundary-Aware circle Loss (BAL) to push far apart CANet-features between adjacent scenes, which is also coupled with the cross-entropy loss to drive CANet-features sensitive to scene boundaries. Experimental results on the MovieNet-SSeg and OVSD datasets show that our method achieves superior performance in temporal scene segmentation compared with state-of-the-art methods.

Funder

National Natural Science Foundation of China

Key Project of Chongqing Technology Innovation and Application Development

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Networks and Communications,Hardware and Architecture

Link

https://dl.acm.org/doi/pdf/10.1145/3630257

Reference64 articles.

1. Max Bain, Arsha Nagrani, Andrew Brown, and Andrew Zisserman. 2020. Condensed movies: Story based retrieval with contextual embeddings. In ACCV. 460–479.

2. Lorenzo Baraldi, Costantino Grana, and Rita Cucchiara. 2015. A deep Siamese network for scene detection in broadcast videos. In MM. 1199–1202.

3. Longformer: The long-document transformer;Beltagy Iz;CoRR,2020

4. Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021. Is space-time attention all you need for video understanding?. In ICML (Proceedings of Machine Learning Research), Vol. 139. 813–824.

5. Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In ECCV. 213–229.

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Semantic Transition Detection for Self-supervised Video Scene Segmentation;MultiMedia Modeling;2024