Affiliation:
1. Department of Computer Science and Engineering, Guru Ghasidas Vishwavidyalaya, Bilaspur, Chhattisgarh,India
2. GLA University, Mathura, Uttar Pradesh, India
Abstract
Precise video moment retrieval is crucial for enabling users to locate specific moments within a large video corpus. This paper presents Interactive Moment Localization with Multimodal Fusion (IMF-MF), a novel interactive moment localization with multimodal fusion model that leverages the power of self-attention to achieve state-of-the-art performance. IMF-MF effectively integrates query context and multimodal features, including visual and audio information, to accurately localize moments of interest. The model operates in two distinct phases: feature fusion and joint representation learning. The first phase dynamically calculates fusion weights for adapting the combination of multimodal video content, ensuring that the most relevant features are prioritized. The second phase employs bi-directional attention to tightly couple video and query features into a unified joint representation for moment localization. This joint representation captures long-range dependencies and complex patterns, enabling the model to effectively distinguish between relevant and irrelevant video segments. The effectiveness of IMF-MF is demonstrated through comprehensive evaluations on three benchmark datasets: TVR for closed-world TV episodes and Charades for open-world user-generated videos, DiDeMo dataset, Open-world, diverse video moment retrieval dataset. The empirical results indicate that the proposed approach surpasses existing state-of-the-art methods in terms of retrieval accuracy, as evaluated by metrics like Recall (R1, R5, R10, and R100) and Intersection-of-Union (IoU). The results consistently demonstrate IMF-MF’s superior performance compared to existing state-of-the-art methods, highlighting the benefits of its innovative interactive moment localization approach and the use of self-attention for feature representation and attention modeling.
Reference31 articles.
1. Video moment retrieval with natural language;Yang;IEEE Transactions on Pattern Analysis and Machine Intelligence,2020
2. Evaluating the performance of code generation models for solving Parsons problems with small prompt variations,;Reeves;Proceedings of the Conference on Innovation and Technology in Computer Science Education,2023
3. Interactive learning for video object segmentation;Vo;IEEE Transactions on Pattern Analysis and Machine Intelligence,2019
4. Attentional pooling for action recognition in videos;Wang;ACM Transactions on Multimedia Computing, Communications, and Applications,2018
5. STRONG: spatio-temporal reinforcement learning for cross-modal video moment localization, in;Cao;Proceedings of the 28th ACM International Conference on Multimedia,2020