Self-contained Entity Discovery from Captioned Videos-Reference-Cited by-同舟云学术

Self-contained Entity Discovery from Captioned Videos

Published:2023-06-07 Issue:5s Volume:19 Page:1-21
ISSN:1551-6857
Container-title:ACM Transactions on Multimedia Computing, Communications, and Applications
language:en
Short-container-title:ACM Trans. Multimedia Comput. Commun. Appl.

Author:

Ayoughi Melika¹^ORCID,Mettes Pascal¹^ORCID,Groth Paul¹^ORCID

Affiliation:

1. University of Amsterdam, The Netherlands

Abstract

This article introduces the task of visual named entity discovery in videos without the need for task-specific supervision or task-specific external knowledge sources. Assigning specific names to entities (e.g., faces, scenes, or objects) in video frames is a long-standing challenge. Commonly, this problem is addressed as a supervised learning objective by manually annotating entities with labels. To bypass the annotation burden of this setup, several works have investigated the problem by utilizing external knowledge sources such as movie databases. While effective, such approaches do not work when task-specific knowledge sources are not provided and can only be applied to movies and TV series. In this work, we take the problem a step further and propose to discover entities in videos from videos and corresponding captions or subtitles. We introduce a three-stage method where we (i) create bipartite entity-name graphs from frame–caption pairs, (ii) find visual entity agreements, and (iii) refine the entity assignment through entity-level prototype construction. To tackle this new problem, we outline two new benchmarks, SC-Friends and SC-BBT , based on the Friends and Big Bang Theory TV series. Experiments on the benchmarks demonstrate the ability of our approach to discover which named entity belongs to which face or scene, with an accuracy close to a supervised oracle, just from the multimodal information present in videos. Additionally, our qualitative examples show the potential challenges of self-contained discovery of any visual entity for future work. The code and the data are available on GitHub. 1

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Networks and Communications,Hardware and Architecture

Link

https://dl.acm.org/doi/pdf/10.1145/3583138

Reference71 articles.

1. Jean-Baptiste Alayrac, Piotr Bojanowski, Nishant Agrawal, Josef Sivic, Ivan Laptev, and Simon Lacoste-Julien. 2016. Unsupervised learning from narrated instruction videos. In CVPR.

2. Max Bain, Arsha Nagrani, Andrew Brown, and Andrew Zisserman. 2020. Condensed movies: Story based retrieval with contextual embeddings. In ACCV.

3. Yi Bin, Xindi Shang, Bo Peng, Yujuan Ding, and Tat-Seng Chua. 2021. Multi-perspective video captioning. In ACM Multimedia.

4. Hervé Bredin, Claude Barras, and Camille Guinaudeau. 2016. Multimodal person discovery in broadcast TV at MediaEval 2016. In MediaEval 2016.

5. Andrew Brown, Ernesto Coto, and Andrew Zisserman. 2021. Automated video labelling: Identifying faces by corroborative evidence. In IEEE-MIPR.