Semantic context driven language descriptions of videos using deep neural network-Reference-Cited by-同舟云学术

Semantic context driven language descriptions of videos using deep neural network

Published:2022-02-10 Issue:1 Volume:9 Page:
ISSN:2196-1115
Container-title:Journal of Big Data
language:en
Short-container-title:J Big Data

Author:

Naik Dinesh^ORCID,Jaidhar C. D.

Abstract

AbstractThe massive addition of data to the internet in text, images, and videos made computer vision-based tasks challenging in the big data domain. Recent exploration of video data and progress in visual information captioning has been an arduous task in computer vision. Visual captioning is attributable to integrating visual information with natural language descriptions. This paper proposes an encoder-decoder framework with a 2D-Convolutional Neural Network (CNN) model and layered Long Short Term Memory (LSTM) as the encoder and an LSTM model integrated with an attention mechanism working as the decoder with a hybrid loss function. Visual feature vectors extracted from the video frames using a 2D-CNN model capture spatial features. Specifically, the visual feature vectors are fed into the layered LSTM to capture the temporal information. The attention mechanism enables the decoder to perceive and focus on relevant objects and correlate the visual context and language content for producing semantically correct captions. The visual features and GloVe word embeddings are input into the decoder to generate natural semantic descriptions for the videos. The performance of the proposed framework is evaluated on the video captioning benchmark dataset Microsoft Video Description (MSVD) using various well-known evaluation metrics. The experimental findings indicate that the suggested framework outperforms state-of-the-art techniques. Compared to the state-of-the-art research methods, the proposed model significantly increased all measures, B@1, B@2, B@3, B@4, METEOR, and CIDEr, with the score of 78.4, 64.8, 54.2, and 43.7, 32.3, and 70.7, respectively. The progression in all scores indicates a more excellent grasp of the context of the inputs, which results in more accurate caption prediction.

Publisher

Springer Science and Business Media LLC

Subject

Information Systems and Management,Computer Networks and Communications,Hardware and Architecture,Information Systems

Link

https://link.springer.com/content/pdf/10.1186/s40537-022-00569-4.pdf

Reference59 articles.

1. Suryawati E, Pardede HF, Zilvan V, Ramdan A, Krisnandi D, Heryana A, Yuwana RS, Kusumo R, Arisal A, Supianto AA. Unsupervised feature learning-based encoder and adversarial networks. J Big Data. 2021;8(1):1–17. https://doi.org/10.1186/s40537-021-00508-9.

2. Alzubaidi L, Zhang J, Humaidi AJ, Al-Dujaili A, Duan Y, Al-Shamma O, Santamaría J, Fadhel MA, Al-Amidie M, Farhan L. Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. J Big Data. 2021;8(1):1–74. https://doi.org/10.1186/s40537-021-00444-8.

3. Sampath V, Maurtua I, Martín JJA, Gutierrez A. A survey on generative adversarial networks for imbalance problems in computer vision tasks. J Big Data. 2021;8(1):1–59. https://doi.org/10.1186/s40537-021-00414-0.

4. Bin Y, Yang Y, Shen F, Xie N, Shen HT, Li X. Describing video with attention-based bidirectional LSTM. IEEE Trans Cybern. 2019;49(7):2631–41. https://doi.org/10.1109/TCYB.2018.2831447.

5. Olivastri S, Singh G, Cuzzolin F. End-to-end video captioning. In: 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pp. 1474–1482, 2019. https://doi.org/10.1109/ICCVW.2019.00185.

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. An improved deep hashing model for image retrieval with binary code similarities;Journal of Big Data;2024-04-18

2. Bilingual video captioning model for enhanced video retrieval;Journal of Big Data;2024-01-16

3. Combinatorial Analysis of Deep Learning and Machine Learning Video Captioning Studies: A Systematic Literature Review;IEEE Access;2024

4. Visualized Analysis of the Emerging Trends of Automated Audio Description Technology;Machine Learning for Cyber Security;2023