Abstract
AbstractThis report describes the Corpus of German Speech (CoGS), a 56-million-word corpus of automatic speech recognition transcripts from YouTube channels of local government entities in Germany. Transcripts have been annotated with latitude and longitude coordinates, making the resource potentially useful for geospatial analyses of lexical, morpho-syntactic, and pragmatic variation; this is exemplified with an exploratory geospatial analysis of grammatical variation in the encoding of past temporal reference. Additional corpus metadata include video identifiers and timestamps on individual word tokens, making it possible to search for specific discourse content or utterance sequences in the corpus and download the underlying video and audio from the web, using open-source tools. The discourse content of the transcripts in CoGS touches upon a wide range of topics, making the resource potentially interesting as a data source for research in digital humanities and social science. The report also briefly discusses the permissibility of reuse of data sourced from German municipalities for corpus-building purposes in the context of EU, German, and American law, which clearly authorize such a use case.
Publisher
Springer Science and Business Media LLC
Subject
Library and Information Sciences,Linguistics and Language,Education,Language and Linguistics
Reference36 articles.
1. Abraham, W., & Conradie, C. J. (2001). Präteritumschwund und Diskursgrammatik. John Benjamins.
2. Aksënova, A., van Esch, D., Flynn, J., & Golik, P. (2021). How might we create better benchmarks for speech recognition? In Proceedings of the 1st Workshop on Benchmarking: Past, Present and Future, Association for Computational Linguistics, pp. 22–34. https://doi.org/10.18653/v1/2021.bppf-1.4.
3. Baevski, A., Zhou, H., Mohamed, A., & Auli, M. (2020). Wav2vec 2.0: A framework for self-supervised learning of speech representations. arXiv:2006.11477v3 [cs.CL]. https://doi.org/10.48550/arXiv.2006.11477.
4. Beilharz, B., Sun, X., Karimova, S., & Riezler, S. (2020). LibriVoxDeEn: A corpus for German-to-English speech translation and German speech recognition. In Proceedings of the 12th Language Resources and Evaluation Conference, pp. 3590–3594. https://aclanthology.org/2020.lrec-1.441/.
5. Brants, S., Dipper, S., Eisenberg, P., Hansen, S., König, E., Lezius, W., Rohrer, C., Smith, G., & Uszkoreit, H. (2004). TIGER: Linguistic interpretation of a german corpus. Research on Language and Computation, 2, 597–620. https://doi.org/10.1007/s11168-004-7431-3.