Multi-view Self-supervised Learning and Multi-scale Feature Fusion for Automatic Speech Recognition-Reference-Cited by-同舟云学术

Multi-view Self-supervised Learning and Multi-scale Feature Fusion for Automatic Speech Recognition

Published:2024-05-08 Issue:3 Volume:56 Page:
ISSN:1573-773X
Container-title:Neural Processing Letters
language:en
Short-container-title:Neural Process Lett

Author:

Zhao Jingyu,Li Ruwei,Tian Maocun,An Weidong

Abstract

AbstractTo address the challenges of the poor representation capability and low data utilization rate of end-to-end speech recognition models in deep learning, this study proposes an end-to-end speech recognition model based on multi-scale feature fusion and multi-view self-supervised learning (MM-ASR). It adopts a multi-task learning paradigm for training. The proposed method emphasizes the importance of inter-layer information within shared encoders, aiming to enhance the model’s characterization capability via the multi-scale feature fusion module. Moreover, we apply multi-view self-supervised learning to effectively exploit data information. Our approach is rigorously evaluated on the Aishell-1 dataset and further validated its effectiveness on the English corpus WSJ. The experimental results demonstrate a noteworthy 4.6

$$\%$$

% reduction in character error rate, indicating significantly improved speech recognition performance . These findings showcase the effectiveness and potential of our proposed MM-ASR model for end-to-end speech recognition tasks.

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1007/s11063-024-11614-z.pdf

Reference64 articles.

1. Seltzer ML, Ju Y-C, Tashev I, Wang Y-Y, Yu D (2011) In-car media search. IEEE Signal Process Mag 28(4):50–60. https://doi.org/10.1109/MSP.2011.941065

2. Graves A, Mohamed A-R, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: 2013 IEEE international conference on acoustics, speech and signal processing, vancouver, BC, Canada, pp 6645-6649, https://doi.org/10.1109/ICASSP.2013.6638947

3. Hinton G et al (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag 29(6):82–97. https://doi.org/10.1109/MSP.2012.2205597

4. Wang D, Wang X, Lv S (2019) An overview of end-to-end automatic speech recognition. Symmetry 11(8):1018

5. Li J (2022) Recent advances in end-to-end automatic speech recognition. APSIPA Trans Signal Inf Process 11(1)