Abstract
Formants are one of the main components of speaker identification systems and the accuracy of formant determination is the basis for the efficiency of speaker identification systems. Improving existing speech recognition systems will significantly simplify human-computer interaction when the use of classic interfaces is not possible, as well as make such work more comfortable and efficient. The necessity for research on this topic is due to unsatisfactory results of existing systems with low signal-to-noise ratio, the dependence of the result on humans, as well as low speed of such systems. The following four main formant trackers were used for comparison with the proposed method: PRAAT, SNACK, ASSP and DEEP. There are a number of studies concerning the comparison of formant trackers, but among them it is impossible to single out the one that has the best efficiency. The selection of formants is accompanied by a number of problems associated with their dynamic change in the language process. The complexity is also caused by a number of problems related to the close location of the peaks in the analysis of spectrograms and the problems of correctly determining the peaks of the formant maxima on the spectrogram. Determining the location of the formant on the spectrograms of the vocal signal is quite easy to perform by man, but the automation of this process causes some difficulties. The selection of frequency formants was proposed to be performed in several stages. The result of the review of approaches to the determination of formant frequencies has been the algorithm consisting of the following nine stages. The segmentation of vocal signal into vocalized fragments and pauses is performed by estimating changes in fractal dimension. Obtaining the spectrum of the vocal signal has been performed using a complex Morlet wavelet based on the Gaussian window function. PRAAT, SNACK, ASSP and DEEP formant trackers have been considered for the study. Each of them has been configured on the basis of a set of default parameters set by the developers of these trackers. A set of settings for each of the trackers has been used for comparison. In the study, trackers independently have been performed segmentation into vocalized fragments and pauses using the VTR-TIMIT dataset. The comparative analysis has been showed a fairly high accuracy in determining the formant frequencies in comparison with existing formant trackers.
Publisher
Taras Shevchenko National University of Kyiv