Exploring the power of pure attention mechanisms in blind room parameter estimation-Reference-Cited by-同舟云学术

Exploring the power of pure attention mechanisms in blind room parameter estimation

Published:2024-04-24 Issue:1 Volume:2024 Page:
ISSN:1687-4722
Container-title:EURASIP Journal on Audio, Speech, and Music Processing
language:en
Short-container-title:J AUDIO SPEECH MUSIC PROC.

Author:

Wang Chunxi,Jia Maoshen^ORCID,Li Meiran,Bao Changchun,Jin Wenyu

Abstract

AbstractDynamic parameterization of acoustic environments has drawn widespread attention in the field of audio processing. Precise representation of local room acoustic characteristics is crucial when designing audio filters for various audio rendering applications. Key parameters in this context include reverberation time (RT

$$_{60}$$

60 ) and geometric room volume. In recent years, neural networks have been extensively applied in the task of blind room parameter estimation. However, there remains a question of whether pure attention mechanisms can achieve superior performance in this task. To address this issue, this study employs blind room parameter estimation based on monaural noisy speech signals. Various model architectures are investigated, including a proposed attention-based model. This model is a convolution-free Audio Spectrogram Transformer, utilizing patch splitting, attention mechanisms, and cross-modality transfer learning from a pretrained Vision Transformer. Experimental results suggest that the proposed attention mechanism-based model, relying purely on attention mechanisms without using convolution, exhibits significantly improved performance across various room parameter estimation tasks, especially with the help of dedicated pretraining and data augmentation schemes. Additionally, the model demonstrates more advantageous adaptability and robustness when handling variable-length audio inputs compared to existing methods.

Funder

Key Programme

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1186/s13636-024-00344-8.pdf

Reference52 articles.

1. Z. Zhang, J. Geiger, J. Pohjalainen, A.E.D. Mousa, W. Jin, B. Schuller, Deep learning for environmentally robust speech recognition: An overview of recent developments. ACM Trans. Intell. Syst. Technol. 9(5) (2018)

2. B. Wu, K. Li, F. Ge, Z. Huang, M. Yang, S.M. Siniscalchi, C.H. Lee, An end-to-end deep learning approach to simultaneous speech dereverberation and acoustic modeling for robust speech recognition. IEEE J. Sel. Top. Signal Process. 11(8), 1289–1300 (2017). https://doi.org/10.1109/JSTSP.2017.2756439

3. N. Mohammadiha, S. Doclo, Speech dereverberation using non-negative convolutive transfer function and spectro-temporal modeling. IEEE/ACM Trans. Audio Speech Lang. Process. 24(2), 276–289 (2016)

4. Stefania Cecchi, Alberto Carini, and Sascha Spors. Room Response Equal izationâĂŤA Review. Applied Sciences. 8, 1 (2018). https://doi.org/10.3390/app8010016

5. W. Jin, W.B. Kleijn, Theory and design of multizone soundfield reproduction using sparse methods. IEEE/ACM Trans. Audio Speech Lang. Process. 23(12), 2343–2355 (2015)