Affiliation:
1. Artificial Intelligence Institute Seoul National University Seoul Republic of Korea
2. Seoul National University Seoul Republic of Korea
3. KT Seoul Republic of Korea
Abstract
AbstractThe development of artificial intelligence (AI) agents capable of human‐level understanding of video content and conducting conversations with humans on this basis is a promising application that people expect. However, this is a challenging task that requires the holistic integration of multimodal information with temporal dependencies and reasoning, as well as social and physical commonsense. In addition, the development of appropriate systematic evaluation methods is essential. In this context, we introduce the Video Turing Test (VTT), a blind test used to evaluate human‐likeness in terms of video comprehension ability. Moreover, we propose Vincent as a video understanding AI. We explain the configuration of VTT, the architecture of Vincent to prepare for VTT and the proposed evaluation methods for video comprehension. We also estimate the current intelligence level of AI based on our results and discuss future research directions.
Funder
National Research Foundation of Korea