A Lightweight Visual Understanding System for Enhanced Assistance to the Visually Impaired Using an Embedded Platform-Reference-Cited by-同舟云学术

A Lightweight Visual Understanding System for Enhanced Assistance to the Visually Impaired Using an Embedded Platform

Published:2024-09-01 Issue: Volume: Page:146-162
ISSN:2616-6909
Container-title:Diyala Journal of Engineering Sciences
language:
Short-container-title:DJES

Author:

Yousif Adel Jalal,Al-Jammas Mohammed H.

Abstract

Visually impaired individuals often face significant challenges in navigating their environments due to limited access to visual information. To address this issue, a portable, cost-effective assistive tool is proposed to operate on a low-power embedded system such as the Jetson Nano. The novelty of this research lies in developing an efficient, lightweight video captioning model within constrained resources to ensure its compatibility with embedded platforms. This research aims to enhance the autonomy and accessibility of visually impaired people by providing audio descriptions of their surroundings through the processing of live-streaming videos. The proposed system utilizes two distinct lightweight deep learning modules: an object detection module based on the state-of-the-art YOLOv7 model, and a video captioning module that utilizes both the Video Swin Transformer and 2D-CNN for feature extraction, along with the Transformer network for caption generation. The goal of the object detection module is for providing real-time multiple object identification in the surrounding environment of the blind while the video captioning module is to provide detailed descriptions of the entire visual scenes and activities including objects, actions, and relationships between them. The user interacts via a headphone with the proposed system using a specific audio command to trigger the corresponding module even object detection or video captioning and receiving an audio description output for the visual contents. The system demonstrates satisfactory results, achieving inference speeds between 0.11 to 1.1 seconds for object detection and 0.91 to 1.85 seconds for video captioning, evaluated through both quantitative metrics and subjective assessments.

Publisher

University of Diyala, College of Science

Reference50 articles.

1. V. V. N. V. P. Kumar, V. P. Teja, A. R. Kumar, V. Harshavardhan and U. Sahith, "Image Summarizer for the Visually Impaired Using Deep Learning," 2021 International Conference on System, Computation, Automation and Networking (ICSCAN), Puducherry, India, pp. 1-4, 2021.

2. B. Arystanbekov, A. Kuzdeuov, S. Nurgaliyev and H. A. Varol, "Image Captioning for the Visually Impaired and Blind: A Recipe for Low-Resource Languages," 2023 45th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Sydney, Australia, pp. 1-4, 2023

3. A. Chharia and R. Upadhyay, "Deep Recurrent Architecture based Scene Description Generator for Visually Impaired," 2020 12th International Congress on Ultra-Modern Telecommunications and Control Systems and Workshops (ICUMT), Brno, Czech Republic, pp. 136-141, 2020.

4. C. Chaitra, Chennamma, R. Vethanayagi, K. M. V. Manoj, B. S. Prashanth, T. Likewin, and D. S. L. Shiva, “Image/Video Summarization in Text/Speech," 2022 IEEE 2nd Mysore Sub Section International Conference (MysuruCon), Mysuru, India, pp. 1-6, 2022.

5. D. N. Jyothi, G. H. Reddy, B. Prashanth and N. V. Vardhan, "Collaborative Training of Object Detection and Re-Identification in Multi-Object Tracking Using YOLOv8," 2024 International Conference on Computing and Data Science (ICCDS), Chennai, India, pp. 1-6, 2024.