Auditory Feature-based Perceptual Distance-Reference-Cited by-同舟云学术

Auditory Feature-based Perceptual Distance

Published:2024-03-03 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Chen Shukai,Thielk Marvin,Gentner Timothy Q.^ORCID

Abstract

AbstractStudies comparing acoustic signals often rely on pixel-wise differences between spectrograms, as in for example mean squared error (MSE). Pixel-wise errors are not representative of perceptual sensitivity, however, and such measures can be highly sensitive to small local signal changes that may be imperceptible. In computer vision, high-level visual features extracted with convolutional neural networks (CNN) can be used to calculate the fidelity of computer-generated images. Here, we propose the auditory perceptual distance (APD) metric based on acoustic features extracted with an unsupervised CNN and validated by perceptual behavior. Using complex vocal signals from songbirds, we trained a Siamese CNN on a self-supervised task using spectrograms rescaled to match the auditory frequency sensitivity of European starlings, Sturnus vulgaris. We define APD for any pair of sounds as the cosine distance between corresponding feature vectors extracted by the trained CNN. We show that APD is more robust to temporal and spectral translation than MSE, and captures the sigmoidal shape of typical behavioral psychometric functions over complex acoustic spaces. When fine-tuned using starlings’ behavioral judgments of naturalistic song syllables, the APD model yields even more accurate predictions of perceptual sensitivity, discrimination, and categorization on novel complex (high-dimensional) acoustic dimensions, including diverging decisions for identical stimuli following different training conditions. Thus, the APD model outperforms MSE in robustness and perceptual accuracy, and offers tunability to match experience-dependent perceptual biases.

Publisher

Cold Spring Harbor Laboratory

Reference40 articles.

1. Do we hear what birds hear in birdsong?

2. Neurally driven synthesis of learned, complex vocalizations

3. Akbari H , Arora H , Cao L , Mesgarani N. Lip2audspec: Speech reconstruction from silent lip movements video. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2018. p. 2516–2520.

4. Deep learning for audio signal processing;IEEE Journal of Selected Topics in Signal Processing,2019

5. A Scale for the Measurement of the Psychological Magnitude Pitch