A Hybrid Model Combining Depthwise Separable Convolutions and Vision Transformers for Traffic Sign Classification Under Challenging Weather Conditions.-Reference-Cited by-同舟云学术

A Hybrid Model Combining Depthwise Separable Convolutions and Vision Transformers for Traffic Sign Classification Under Challenging Weather Conditions.

Published:2024-06-07 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Parse Milind Vijay¹^ORCID,Pramod Dhanya²,Kumar Deepak³

Affiliation:

1. Symbiosis International University: Symbiosis International (Deemed University)

2. Symbiosis International (Deemed University)

3. Amity University Greater Noida

Abstract

This research presents a novel deep-learning framework designed for traffic sign image classification under adverse conditions, including rain, shadows, haze, codec errors, and dirty lenses. To effectively balance accuracy and training parameters, the approach combines depthwise and pointwise convolutions, often referred to as depthwise separable convolutions, with a Vision Transformer (ViT) for subsequent feature extraction. The framework's initial block comprises two pairs of depthwise and pointwise convolutional layers followed by a normalization layer. Depthwise convolution is responsible for processing each input channel independently and applying separate filters to each channel, thereby reducing computational cost and parameters while maintaining spatial structure. Pointwise convolutional layers combine information from different channels, fostering complex feature interactions and non-linearities. Batch normalization is used for training stability. At the end of the initial block, the max pooling layer is used to enhance and downsample spatial dimensions. The architecture repeats four times, preserving crucial information through skip connections. To extract global context information, inter-block skip connections and global average pooling (GAP) are employed for dimensionality reduction while retaining vital information. Integration of the ViT model in the final layers captures far-reaching dependencies and relations in the feature maps. The framework concludes with two fully connected layers, a bottleneck layer with 1024 neurons and a second layer using softmax activation to generate a probability distribution over 14 classes. The proposed framework, combining convolution blocks and skip connections with precisely tuned ViT hyperparameters, enhances model performance and achieves an exceptional validation accuracy of 99.3%.

Publisher

Research Square Platform LLC

Reference24 articles.

1. Temel D, Kwon G, Prabhushankar M, AlRegib G (2017) CURE-TSR: Challenging unreal and real environments for traffic sign recognition. arXiv preprint arXiv:1712.02463

2. Kamal U, Das S, Abrar A, Hasan MK (2017) Traffic-sign detection and classification under challenging conditions: a deep neural network-based approach. IEEE video and image processing cup

3. Recent Advances in Traffic Sign Recognition: Approaches and Datasets;Lim XR;Sensors,2023

4. Katoch A (2022) Potential of Vision Transformers for Advanced Driver-Assistance Systems: An Evaluative Approach (Doctoral dissertation, The University of Western Ontario (Canada))

5. Comparing Vision Transformers and Convolutional Neural Networks for Image Classification: A Literature Review;Maurício J;Appl Sci,2023