Author:
Zhang Jing-Xuan,Richmond Korin,Ling Zhen-Hua,Dai Lirong
Abstract
This paper presents TaLNet, a model for voice reconstruction with ultrasound tongue and optical lip videos as inputs. TaLNet is based on an encoder-decoder architecture. Separate encoders are dedicated to processing the tongue and lip data streams respectively. The decoder predicts acoustic features conditioned on encoder outputs and speaker codes.To mitigate for having only relatively small amounts of dual articulatory-acoustic data available for training, and since our task here shares with text-to-speech (TTS) the common goal of speech generation, we propose a novel transfer learning strategy to exploit the much larger amounts of acoustic-only data available to train TTS models. For this, a Tacotron 2 TTS model is first trained, and then the parameters of its decoder are transferred to the TaLNet decoder. We have evaluated our approach on an unconstrained multi-speaker voice recovery task. Our results show the effectiveness of both the proposed model and the transfer learning strategy. Speech reconstructed using our proposed method significantly outperformed all baselines (DNN, BLSTM and without transfer learning) in terms of both naturalness and intelligibility. When using an ASR model decoding the recovery speech, the WER of our proposed method is relatively reduced over 30% compared to baselines.
Publisher
Association for the Advancement of Artificial Intelligence (AAAI)
Cited by
7 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Audio–visual deepfake detection using articulatory representation learning;Computer Vision and Image Understanding;2024-11
2. Incorporating Ultrasound Tongue Images for Audio-Visual Speech Enhancement;IEEE/ACM Transactions on Audio, Speech, and Language Processing;2024
3. Speech Reconstruction from Silent Tongue and Lip Articulation by Pseudo Target Generation and Domain Adversarial Training;ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP);2023-06-04
4. Self-Supervised Audio-Visual Speech Representations Learning by Multimodal Self-Distillation;ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP);2023-06-04
5. Speech Emotion Recognition via Heterogeneous Feature Learning;ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP);2023-06-04