Target Speaker Extraction Using Attention-Enhanced Temporal Convolutional Network-Reference-Cited by-同舟云学术

Target Speaker Extraction Using Attention-Enhanced Temporal Convolutional Network

Published:2024-01-10 Issue:2 Volume:13 Page:307
ISSN:2079-9292
Container-title:Electronics
language:en
Short-container-title:Electronics

Author:

Wang Jian-Hong¹,Lai Yen-Ting²,Tai Tzu-Chiang³,Le Phuong Thi⁴,Pham Tuan⁵^ORCID,Wang Ze-Yu¹,Li Yung-Hui⁶^ORCID,Wang Jia-Ching⁷,Chang Pao-Chi²^ORCID

Affiliation:

1. School of Computer Science and Technology, Shandong University of Technology, Zibo 255000, China

2. Department of Communication Engineering, National Central University, Taoyuan 32001, Taiwan

3. Department of Computer Science and Information Engineering, Providence University, Taichung 43301, Taiwan

4. Department of Computer Science and Information Engineering, Fu Jen Catholic University, New Taipei City 24205, Taiwan

5. Faculty of Digital Technology, The University of Danang—University of Technology and Education, Danang 550000, Vietnam

6. AI Research Center, Hon Hai Research Institute, New Taipei City 236, Taiwan

7. Department of Computer Science and Information Engineering, National Central University, Taoyuan 32001, Taiwan

Abstract

When recording conversations, there may be multiple people talking at once. While our human ears can filter out unwanted sounds, this can be challenging for automatic speech recognition (ASR) systems, leading to reduced accuracy. To address this issue, preprocessing mechanisms such as speech separation and targeted speaker extraction are necessary to separate each person’s speech. With the development of deep learning, the quality of separated speech has improved significantly. Our objective is to focus on speaker extraction, which entails implementing a primary system for speech extraction and a secondary subsystem for delivering target information. To accomplish this, we have chosen a temporal convolutional network (TCN) architecture as the foundation of our speech extraction model. A TCN enables convolutional neural networks (CNNs) to manage time series modeling, and it can be constructed in various model lengths. Furthermore, we have integrated attention enhancement into the secondary subsystem to provide the speech extraction model with comprehensive and effective target information, which helps to improve the model’s ability to estimate masks. As a result, the quality of the target speaker extraction will be greatly enhanced with a more precise mask.

Funder

National Science and Technology Council of Taiwan

Publisher

MDPI AG

Subject

Electrical and Electronic Engineering,Computer Networks and Communications,Hardware and Architecture,Signal Processing,Control and Systems Engineering

Link

https://www.mdpi.com/2079-9292/13/2/307/pdf

Reference34 articles.

1. Some experiments on the recognition of speech, with one and with two ears;Cherry;J. Acoust. Soc. Am.,1953

2. Two-microphone separation of speech mixtures;Pedersen;IEEE Trans. Neural Netw.,2008

3. Bartelds, M., San, N., McDonnell, B., Jurafsky, D., and Wieling, M. (2023). Making More of Little Data: Improving Low-Resource Automatic Speech Recognition Using Data Augmentation. arXiv.

4. Recent advances in end-to-end automatic speech recognition;Li;APSIPA Trans. Signal Inf. Process.,2022

5. Towards inclusive automatic speech recognition;Feng;Comput. Speech Lang.,2024