A survey of Polish ASR speech datasets-Reference-Cited by-同舟云学术

A survey of Polish ASR speech datasets

Published:2024-03-01 Issue:1 Volume:60 Page:27-52
ISSN:0137-2459
Container-title:Poznan Studies in Contemporary Linguistics
language:en
Short-container-title:

Author:

Junczyk Michał¹^ORCID

Affiliation:

1. Adam Mickiewicz University , Poznań , Poland

Abstract

Abstract Access to speech datasets is essential for the effective use of modern ASR systems in low-resource languages like Polish. However, the lack of centralized information and metadata describing available datasets poses a significant challenge to researchers and practitioners. In this paper, we address this issue by presenting the most comprehensive survey of Polish ASR speech datasets to date. We manually curated information on 53 publicly available datasets and annotated them with 61 attributes, providing a comprehensive catalog of these resources. The catalog facilitates the discovery and evaluation of available datasets, enabling researchers to identify datasets that suit their specific needs. It also enables the identification of gaps in the existing datasets, which may inform future research directions. The catalog is open and community-driven, which means that new data sets can be added and issues can be reported, ensuring its continued relevance and usefulness to the ASR community. Our work contributes to improving the accessibility and usability of ASR systems in low-resource languages such as Polish.

Publisher

Walter de Gruyter GmbH

Link

https://www.degruyter.com/document/doi/10.1515/psicl-2023-0019/pdf

Reference32 articles.

1. Aksënova, Alëna, Zhehuai Chen, Chung-Cheng Chiu, Daan van Esch, Pavel Golik, Wei Han, Bhuvana Ramabhadran, Levi King, Andrew Rosenberg, Susan Schwartz & Gary Wang. 2022. Accented speech recognition: Benchmarking, pre-training, and diverse data. arXiv preprint arXiv:2205.08014.

2. Aksënova, Alëna, Daan van Esch, James Flynn & Pavel Golik. 2021. How might we create better benchmarks for speech recognition? In Proceedings of the 1st workshop on benchmarking: Past, present and future, 22–34.

3. Ardila, Rosana, Megan Branson, Kelly Davis, Michael Kohler, Josh Meyer, Michael Henretty, Reuben Morais, Lindsay Saunders, Francis Tyers & Gregor Weber. 2019. Common voice: A massively-multilingual speech corpus. Proceedings of the twelfth language resources and evaluation conference, 4218–4222. European Language Resources Association.

4. Augustyniak, Łukasz, Kamil Tagowski, Albert Sawczyn, Denis Janiak, Roman Bartusiak, Adrian Szymczak, Marcin Wątroba, Arkadiusz Janz, Piotr Szymański, Mikołaj Morzy, Tomasz Kajdanowicz & Maciej Piasecki. 2022. This is the way: Designing and compiling LEPISZCZE, a comprehensive NLP benchmark for Polish. Arxiv:2211.13112.

5. Bender, Emily M. & Batya Friedman. 2018. Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics 6. 587–604. https://doi.org/10.1162/tacl_a_00041.