Improving Tone Recognition Performance using Wav2vec 2.0-Based Learned Representation in Yoruba, a Low-Resourced Language-Reference-Cited by-同舟云学术

Improving Tone Recognition Performance using Wav2vec 2.0-Based Learned Representation in Yoruba, a Low-Resourced Language

Published:2024-08-30 Issue: Volume: Page:
ISSN:2375-4699
Container-title:ACM Transactions on Asian and Low-Resource Language Information Processing
language:en
Short-container-title:ACM Trans. Asian Low-Resour. Lang. Inf. Process.

Author:

Bengono Obiang Saint Germes B.¹²^ORCID,Tsopze Norbert³²^ORCID,Melatagia Yonta Paulin⁴²^ORCID,Bonastre Jean-Francois⁵⁶^ORCID,Jiménez Tania⁵^ORCID

Affiliation:

1. University of Yaounde I, Yaounde, Cameroon

2. Sorbonne Université - IRD - UMMISCO - F-93143, Bondy France

3. Faculty of Sciences, Computer Science, University of Yaounde I, Yaounde Cameroon

4. Computer Science, University of Yaounde I Faculty of Sciences, Yaounde Cameroon

5. Avignon Universite Laboratoire Informatique d'Avignon, Avignon, France

6. Defense and Security dept, Inria, Paris France

Abstract

Many sub-Saharan African languages are categorized as tone languages and for the most part, they are classified as low resource languages due to the limited resources and tools available to process these languages. Identifying the tone associated with a syllable is therefore a key challenge for speech recognition in these languages. We propose models that automate the recognition of tones in continuous speech that can easily be incorporated into a speech recognition pipeline for these languages. We have investigated different neural architectures as well as several features extraction algorithms in speech (Filter banks, LEAF, Cestrogram, MFCC). In the context of low-resource languages, we also evaluated Wav2vec models for this task. In this work, we use a public speech recognition dataset on Yoruba. As for the results, using the combination of features obtained from CS (Cestrogram) and FB (Filters Bank), we obtain a minimum TER (Tone Error Rate) of 19.54% while the evaluations of the models using Wav2vec 2.0, we have a TER of 17.72% demonstrating that the use of Wav2vec 2.0 provides better performance than the models used in the literature for tone identification on low-resource languages.

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3690384

Reference31 articles.

1. Oliver Adams, Trevor Cohn, Graham Neubig, and Alexis Michaud. 2017. Phonemic Transcription of Low-Resource Tonal Languages. In Proceedings of the Australasian Language Technology Association Workshop 2017. Brisbane, Australia, 53–60. https://aclanthology.org/U17-1006

2. Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 12449–12460. https://proceedings.neurips.cc/paper_files/paper/2020/file/92d1e1eb1cd6f9fba3227870bb6d7f07-Paper.pdf

3. AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline

4. Malgorzata Ćavar, Damir Ćavar, and Hilaria Cruz. 2016. Endangered Language Documentation: Bootstrapping a Chatino Speech Corpus, Forced Aligner, ASR. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16). European Language Resources Association (ELRA), Portorož, Slovenia, 4004–4011. https://aclanthology.org/L16-1632

5. Charles Chen Razvan C. Bunescu Li Xu and Chang Liu. 2016. Tone Classification in Mandarin Chinese Using Convolutional Neural Networks. In Interspeech.