Real-time Arabic Video Captioning Using CNN and Transformer Networks Based on Parallel Implementation-Reference-Cited by-同舟云学术

Real-time Arabic Video Captioning Using CNN and Transformer Networks Based on Parallel Implementation

Published:2024-03-07 Issue: Volume: Page:84-93
ISSN:2616-6909
Container-title:Diyala Journal of Engineering Sciences
language:
Short-container-title:DJES

Author:

Yousif Adel Jalal,Al-Jammas Mohammed H.

Abstract

Video captioning techniques have practical applications in fields like video surveillance and robotic vision, particularly in real-time scenarios. However, most of the current approaches still exhibit certain limitations when applied to live video, and research has predominantly focused on English language captioning. In this paper, we introduced a novel approach for live real-time Arabic video captioning using deep neural networks with a parallel architecture implementation. The proposed model primarily relied on the encoder-decoder architecture trained end-to-end on Arabic text. Video Swin Transformer and deep convolutional network are employed for video understanding, while the standard Transformer architecture is utilized for both video feature encoding and caption decoding. Results from experiments conducted on the translated MSVD and MSR-VTT datasets demonstrate that utilizing an end-to-end Arabic model yielded better performance than methods involving the translation of generated English captions to Arabic. Our approach demonstrates notable advancements over compared methods, yielding a CIDEr score of 78.3 and 36.3 for the MSVD and MSRVTT datasets, respectively. In the context of inference speed, our model achieved a latency of approximately 95 ms using an RTX 3090 GPU for a temporal video segment with 16 frames captured online from a camera device.

Publisher

University of Diyala, College of Science

Reference27 articles.

1. A. J. Yousif and M. H. Al-Jammas, “Exploring deep learning approaches for video captioning: A comprehensive review,” e-Prime - Adv. Electr. Eng. Electron. Energy, vol. 6, no. October, p. 100372, 2023, doi: 10.1016/j.prime.2023.100372.

2. V. Chundi, J. Bammidi, A. Pegallapati, Y. Parnandi, A. Reddithala and S. K. Moru, "Intelligent Video Surveillance Systems," 2021 International Carnahan Conference on Security Technology (ICCST), Hatfield, United Kingdom, 2021, pp. 1-5, doi: 10.1109/ICCST49569.2021.9717400.

3. B. Irfan, A. Ramachandran, S. Spaulding, D. F. Glas, I. Leite and K. L. Koay, "Personalization in Long-Term Human-Robot Interaction," 2019 14th ACM/IEEE International Conference on Human-Robot Interaction (HRI), Daegu, Korea (South), 2019, pp. 685-686, doi: 10.1109/HRI.2019.8673076.

4. J. O. Williams, “Narrow-band analyzer,” Ph.D. dissertation, Dept. Elect. Eng., Harvard Univ., Cambridge, MA, 1993.

5. A. Khan, A. Khan and M. Waleed, "Wearable Navigation Assistance System for the Blind and Visually Impaired," 2018 International Conference on Innovation and Intelligence for Informatics, Computing, and Technologies (3ICT), Sakhier, Bahrain, 2018, pp. 1-6, doi: 10.1109/3ICT.2018.8855778.

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A Lightweight Visual Understanding System for Enhanced Assistance to the Visually Impaired Using an Embedded Platform;Diyala Journal of Engineering Sciences;2024-09-01

2. Crime Activity Detection in Surveillance Videos Based on Developed Deep Learning Approach;Diyala Journal of Engineering Sciences;2024-09-01