Abstract
Different speakers exhibit unique acoustic characteristics, presenting distinct texture patterns in spectrograms. Current research primarily focused on speaker clustering and the design of speaker verification network architectures, but the potential effects of employing varying degrees of differentiation on speaker characteristics have not been explored in the context of speaker verification tasks. In this study, we drew inspiration from the application of edge-oriented convolution block for image texture extraction in image super-resolution analysis, and proposed a new speaker verification model integrated with texture variational features which designed a feature extraction module called order-FCM. We employed Sobel and Laplacian operators to compute first- and second-order differences in the time-frequency domain, aiming to enhance the network's learning of the richer dynamic properties and higher-order features of speech signals. Additionally, we introduced the global response normalization method from the field of computer vision to normalize global channel features at the end of order-FCM, mitigating potential feature collapse phenomena. We integrated the designed module with existing speaker verification networks, and experimental results demonstrated that the performance of this method significantly surpasses that of baseline models on validation sets provided by Vox1-O, Vox1-E, Vox1-H, and the 2023 International Speaker Recognition Challenge (Voxsrc2023).