MLLP-VRAIN Spanish ASR Systems for the Albayzín-RTVE 2020 Speech-to-Text Challenge: Extension-Reference-Cited by-同舟云学术

MLLP-VRAIN Spanish ASR Systems for the Albayzín-RTVE 2020 Speech-to-Text Challenge: Extension

Published:2022-01-13 Issue:2 Volume:12 Page:804
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Baquero-Arnal Pau^ORCID,Jorge Javier^ORCID,Giménez Adrià^ORCID,Iranzo-Sánchez Javier^ORCID,Pérez Alejandro^ORCID,Garcés Díaz-Munío Gonçal Vicent^ORCID,Silvestre-Cerdà Joan Albert^ORCID,Civera Jorge^ORCID,Sanchis Albert^ORCID,Juan Alfons^ORCID

Abstract

This paper describes the automatic speech recognition (ASR) systems built by the MLLP-VRAIN research group of Universitat Politècnica de València for the Albayzín-RTVE 2020 Speech-to-Text Challenge, and includes an extension of the work consisting of building and evaluating equivalent systems under the closed data conditions from the 2018 challenge. The primary system (p-streaming_1500ms_nlt) was a hybrid ASR system using streaming one-pass decoding with a context window of 1.5 seconds. This system achieved 16.0% WER on the test-2020 set. We also submitted three contrastive systems. From these, we highlight the system c2-streaming_600ms_t which, following a similar configuration as the primary system with a smaller context window of 0.6 s, scored 16.9% WER points on the same test set, with a measured empirical latency of 0.81 ± 0.09 s (mean ± stdev). That is, we obtained state-of-the-art latencies for high-quality automatic live captioning with a small WER degradation of 6% relative. As an extension, the equivalent closed-condition systems obtained 23.3% WER and 23.5% WER, respectively. When evaluated with an unconstrained language model, we obtained 19.9% WER and 20.4% WER; i.e., not far behind the top-performing systems with only 5% of the full acoustic data and with the extra ability of being streaming-capable. Indeed, all of these streaming systems could be put into production environments for automatic captioning of live media streams.

Funder

European Union

Erasmus+ Education

Generalitat Valenciana

Universitat Politècnica de València

Publisher

MDPI AG

Subject

Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science

Link

https://www.mdpi.com/2076-3417/12/2/804/pdf

Reference51 articles.

1. Royal Decree 1494/2007 (Spain) on Accessibility to the Media https://www.boe.es/buscar/act.php?id=BOE-A-2007-19968

2. Law 1/2006 (Comunitat Valenciana, Spain) on the Audiovisual Sector https://www.dogv.gva.es/va/eli/es-vc/l/2006/04/19/1/dof/vci-spa/pdf

3. Automatic Speech Recognition: A Deep Learning Approach;Yu,2014

4. Framewise phoneme classification with bidirectional LSTM and other neural network architectures