Improving Hybrid CTC/Attention Architecture with Time-Restricted Self-Attention CTC for End-to-End Speech Recognition-Reference-Cited by-同舟云学术

Improving Hybrid CTC/Attention Architecture with Time-Restricted Self-Attention CTC for End-to-End Speech Recognition

Published:2019-10-31 Issue:21 Volume:9 Page:4639
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Wu Long^ORCID,Li Ta,Wang Li,Yan Yonghong

Abstract

As demonstrated in hybrid connectionist temporal classification (CTC)/Attention architecture, joint training with a CTC objective is very effective to solve the misalignment problem existing in the attention-based end-to-end automatic speech recognition (ASR) framework. However, the CTC output relies only on the current input, which leads to the hard alignment issue. To address this problem, this paper proposes the time-restricted attention CTC/Attention architecture, which integrates an attention mechanism with the CTC branch. “Time-restricted” means that the attention mechanism is conducted on a limited window of frames to the left and right. In this study, we first explore time-restricted location-aware attention CTC/Attention, establishing the proper time-restricted attention window size. Inspired by the success of self-attention in machine translation, we further introduce the time-restricted self-attention CTC/Attention that can better model the long-range dependencies among the frames. Experiments with wall street journal (WSJ), augmented multiparty interaction (AMI), and switchboard (SWBD) tasks demonstrate the effectiveness of the proposed time-restricted self-attention CTC/Attention. Finally, to explore the robustness of this method to noise and reverberation, we join a train neural beamformer frontend with the time-restricted attention CTC/Attention ASR backend in the CHIME-4 dataset. The reduction of word error rate (WER) and the increase of perceptual evaluation of speech quality (PESQ) approve the effectiveness of this framework.

Publisher

MDPI AG

Subject

Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science

Link

https://www.mdpi.com/2076-3417/9/21/4639/pdf

Reference31 articles.

1. Recent progresses in deep learning based acoustic models

2. EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding

3. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition

4. A Comparison of Sequence-to-Sequence Models for Speech Recognition

5. Exploring neural transducers for end-to-end speech recognition

Cited by 10 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. The FawAI ASR System for the ISCSLP 2022 Intelligent Cockpit Speech Recognition Challenge;2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP);2022-12-11

2. 3M: Multi-loss, Multi-path and Multi-level Neural Networks for speech recognition;2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP);2022-12-11

3. Improving Hybrid CTC/Attention Architecture for Agglutinative Language Speech Recognition;Sensors;2022-09-27

4. Hybrid end-to-end model for Kazakh speech recognition;International Journal of Speech Technology;2022-08-02

5. Cursive Text Recognition in Natural Scene Images Using Deep Convolutional Recurrent Neural Network;IEEE Access;2022