Abstract
Speech synthesis has made significant progress in recent years thanks to deep neural networks (DNNs). However, one of the challenges of DNN-based models is the requirement for large and diverse data, which limits their applicability to many languages and domains. To date, no multi-speaker text-to-speech (TTS) dataset has been available in Persian, which hinders the development of these models for this language. In this paper, we present a novel dataset for multi-speaker TTS in Persian, which consists of 120 hours of high-quality speech from 67 speakers. We use this dataset to train two synthesizers and a vocoder and evaluate the quality of the synthesized speech. The results show that the naturalness of the generated samples, measured by the mean opinion score (MOS) criterion, is 3.94 and 4.12 for two trained multi-speaker synthesizers, which indicates that the dataset is suitable for training multi-speaker TTS models and can facilitate future research in this area for Persian.