Self-Attention-Based Convolutional GRU for Enhancement of Adversarial Speech Examples-Reference-Cited by-同舟云学术

Self-Attention-Based Convolutional GRU for Enhancement of Adversarial Speech Examples

Published:2023-07-08 Issue: Volume: Page:
ISSN:0219-4678
Container-title:International Journal of Image and Graphics
language:en
Short-container-title:Int. J. Image Grap.

Author:

Jannu Chaitanya¹,Vanambathina Sunny Dayal¹

Affiliation:

1. School of Electronics Engineering, VIT-AP University, Amaravati, Vijayawada, AP, India

Abstract

Recent research has identified adversarial examples which are the challenges to DNN-based ASR systems. In this paper, we propose a new model based on Convolutional GRU and Self-attention U-Net called [Formula: see text] to improve adversarial speech signals. To represent the correlation between neighboring noisy speech frames, a two-Layer GRU is added in the bottleneck of U-Net and an attention gate is inserted in up-sampling units to increase the adversarial stability. The goal of using GRU is to combine the weights sharing technique with the use of gates to control the flow of data across multiple feature maps. As a result, it outperforms the original 1D convolution used in [Formula: see text]. Especially, the performance of the model is evaluated by explainable speech recognition metrics and its performance is analyzed by the improved adversarial training. We used adversarial audio attacks to perform experiments on automatic speech recognition (ASR). We saw (i) the robustness of ASR models which are based on DNN can be improved using the temporal features grasped by the attention-based GRU network; (ii) through adversarial training, including some additive adversarial data augmentation, we could improve the generalization power of Automatic Speech Recognition models which are based on DNN. The word-error-rate (WER) metric confirmed that the enhancement capabilities are better than the state-of-the-art [Formula: see text]. The reason for this enhancement is the ability of GRU units to extract global information within the feature maps. Based on the conducted experiments, the proposed [Formula: see text] increases the score of Speech Transmission Index (STI), Perceptual Evaluation of Speech Quality (PESQ), and the Short-term Objective Intelligibility (STOI) with adversarial speech examples in speech enhancement.

Publisher

World Scientific Pub Co Pte Ltd

Subject

Computer Graphics and Computer-Aided Design,Computer Science Applications,Computer Vision and Pattern Recognition

Link

https://www.worldscientific.com/doi/pdf/10.1142/S0219467824500530