Abstract
Emerging linguistic problems are data-driven and multidisciplinary, requiring richly transcribed corpora. Accurate corpus transcription therefore demands intelligent protocols that satisfy the following important criteria: 1) acceptability by end-users, computers/machines; 2) conformity to existing language standards, rules and structures; and 3) representation within the context of the intended language domain. To demonstrate the feasibility of these criteria, a template-based framework for multilingual transcription was proposed and implemented. The first version of the developed transcription tool, also called SCAnnAL (Speech Corpus Annotator for African Languages), applies signal processing to pre-segment waveforms of a recorded speech corpus, into word, syllable and phoneme units, resulting in a pre-segmented TextGrid file with empty labels. Using preformatted templates, the front-end or linguistic aspects/datasets (the text corpus, vowels inventory, consonants inventory, and a set of syllabification rules) are specified in a default language. A Natural Language Understanding (NLU) algorithm then uses these datasets with a data-driven syllabification algorithm to relabel subtrees of the TextGrid file. Tone pattern models were finally constructed from translations of experimental data, using the Ibadan 400 words (a list of basic items of a language), for four Nigerian tone languages. Integration of the tone pattern models into the transcription system is expected in a future paper. This research will benefit emerging digital humanists and computational linguists working on language data, as well as open new opportunities for improved African tone language speech processing systems.
Publisher
Edinburgh University Press
Subject
Human-Computer Interaction,General Arts and Humanities,General Computer Science