Two-Step Joint Optimization with Auxiliary Loss Function for Noise-Robust Speech Recognition-Reference-Cited by-同舟云学术

Two-Step Joint Optimization with Auxiliary Loss Function for Noise-Robust Speech Recognition

Published:2022-07-19 Issue:14 Volume:22 Page:5381
ISSN:1424-8220
Container-title:Sensors
language:en
Short-container-title:Sensors

Author:

Lee Geon Woo,Kim Hong Kook^ORCID

Abstract

In this paper, a new two-step joint optimization approach based on the asynchronous subregion optimization method is proposed for training a pipeline model composed of two different models. The first-step processing of the proposed joint optimization approach trains the front-end model only, and the second-step processing trains all the parameters of the combined model together. In the asynchronous subregion optimization method, the first-step processing only supports the goal of the front-end model. However, the first-step processing of the proposed approach works with a new loss function to make the front-end model support the goal of the back-end model. The proposed optimization approach was applied, here, to a pipeline composed of a deep complex convolutional recurrent network (DCCRN)-based speech enhancement model and a conformer-transducer-based ASR model as a front-end and a back-end, respectively. Then, the performance of the proposed two-step joint optimization approach was evaluated on the LibriSpeech automatic speech recognition (ASR) corpus in noisy environments by measuring the character error rate (CER) and word error rate (WER). In addition, an ablation study was carried out to examine the effectiveness of the proposed optimization approach on each of the processing blocks in the conformer-transducer ASR model. Consequently, it was shown from the ablation study that the conformer-transducer-based ASR model with the joint network trained only by the proposed optimization approach achieved the lowest average CER and WER. Moreover, the proposed optimization approach reduced the average CER and WER on the Test-Noisy dataset under matched noise conditions by 0.30% and 0.48%, respectively, compared to the approach of separate optimization of speech enhancement and ASR. Compared to the conventional two-step joint optimization approach, the proposed optimization approach provided average CER and WER reductions of 0.22% and 0.31%, respectively. Moreover, it was revealed that the proposed optimization approach achieved a lower average CER and WER, by 0.32% and 0.43%, respectively, than the conventional optimization approach under mismatched noise conditions.

Funder

This work was conducted by Center for Applied Research in Artificial Intelligence(CARAI) grant funded by DAPA and ADD

Publisher

MDPI AG

Subject

Electrical and Electronic Engineering,Biochemistry,Instrumentation,Atomic and Molecular Physics, and Optics,Analytical Chemistry

Link

https://www.mdpi.com/1424-8220/22/14/5381/pdf

Reference52 articles.

1. A Review of On-Device Fully Neural End-to-End Automatic Speech Recognition Algorithms

2. Recent Advances in End-to-End Automatic Speech Recognition

3. Supervised Speech Separation Based on Deep Learning: An Overview

4. An Experimental Analysis of Deep Learning Architectures for Supervised Speech Enhancement

5. Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Cluster-Based Pairwise Contrastive Loss for Noise-Robust Speech Recognition;Sensors;2024-04-17

2. Knowledge Distillation-Based Training of Speech Enhancement for Noise-Robust Automatic Speech Recognition;IEEE Access;2024