Affiliation:
1. Department of Electricity and Electronics, Faculty of Science and Technology, University of the Basque Country (UPV/EHU), Barrio Sarriena, 48940 Leioa, Spain
Abstract
The development of speech technology requires large amounts of data to estimate the underlying models. Even when relying on large multilingual pre-trained models, some amount of task-specific data on the target language is needed to fine-tune those models and obtain competitive performance. In this paper, we present a bilingual Basque–Spanish dataset extracted from parliamentary sessions. The dataset is designed to develop and evaluate automatic speech recognition (ASR) systems but can be easily repurposed for other speech-processing tasks (such as speaker or language recognition). The paper first compares the two target languages, emphasizing their similarities at the acoustic-phonetic level, which sets the basis for sharing data and compensating for the relatively small amount of spoken resources available for Basque. Then, Basque Parliament plenary sessions are characterized in terms of organization, topics, speaker turns and the use of the two languages. The paper continues with the description of the data collection procedure (involving both speech and text), the audio formats and conversions along with the creation and postprocessing of text transcriptions and session minutes. Then, it describes the semi-supervised iterative procedure used to cut, rank and select the training segments and the manual supervision process employed to produce the test set. Finally, ASR experiments are presented using state-of-the-art technology to validate the dataset and to set a reference for future works. The datasets, along with models and recipes to reproduce the experiments reported in the paper, are released through Hugging Face.
Funder
Ministerio de Ciencia e Innovación
Basque Government
Reference46 articles.
1. Rehm, G., and Way, A. (2023). European Language Equality: A Strategic Agenda for Digital Language Equality, Springer International Publishing. Cognitive Technologies.
2. Building an ASR Corpus Based on Bulgarian Parliament Speeches;Geneva;Proceedings of the 7th International Conference on Statistical Language and Speech Processing,2019
3. Kirkedal, A., Stepanovic, M., and Plank, B. (2020, January 25–29). FT Speech: Danish Parliament Speech Corpus. Proceedings of the Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China.
4. Kratochvíl, J., Polak, P., and Bojar, O. (2020, January 11–16). Large Corpus of Czech Parliament Plenary Hearings. Proceedings of the 12th Language Resources and Evaluation Conference, LREC 2020, Marseille, France. Available online: https://aclanthology.org/2020.lrec-1.781.
5. Plüss, M., Neukom, L., and Vogel, M. (2020). Swiss Parliaments Corpus, an Automatically Aligned Swiss German Speech to Standard German Text Corpus. arXiv.