Developing computational infrastructure for the CorCenCC corpus: The National Corpus of Contemporary Welsh-Reference-Cited by-同舟云学术

Developing computational infrastructure for the CorCenCC corpus: The National Corpus of Contemporary Welsh

Published:2020-07-29 Issue:3 Volume:55 Page:789-816
ISSN:1574-020X
Container-title:Language Resources and Evaluation
language:en
Short-container-title:Lang Resources & Evaluation

Author:

Knight Dawn^ORCID,Loizides Fernando,Neale Steven,Anthony Laurence,Spasić Irena

Abstract

AbstractCorCenCC (Corpws Cenedlaethol Cymraeg Cyfoes—National Corpus of Contemporary Welsh) is the first comprehensive corpus of Welsh designed to be reflective of language use across communication types, genres, speakers, language varieties (regional and social) and contexts. This article focuses on the computational infrastructure that we have designed to support data collection for CorCenCC, and the subsequent uses of the corpus which include lexicography, pedagogical research and corpus analysis. A grass-roots approach to design has been adopted, that has adapted and extended previous corpus-building and introduced new features as required for this specific context and language. The key pillars of the infrastructure include a framework that supports metadata collection, an innovative mobile application designed to collect spoken data (utilising a crowdsourcing approach), a backend database that stores curated data and a web-based interface that allows users to query the data online. A usability study was conducted to evaluate the user facing tools and to suggest directions for future improvements. Though the infrastructure was developed for Welsh language collection, its design can be re-used to support corpus development in other minority or major language contexts, broadening the potential utility and impact of this work.

Funder

Economic and Social Research Council

Publisher

Springer Science and Business Media LLC

Subject

Library and Information Sciences,Linguistics and Language,Education,Language and Linguistics

Link

https://link.springer.com/content/pdf/10.1007/s10579-020-09501-9.pdf

Reference51 articles.

1. Adolphs, S., Knight, D., Smith, C., & Price, D. (2020). Crowdsourcing formulaic phrases: towards a new type of spoken corpus. Corpora, 15(1), in press.

2. Anthony, L. (2014). AntConc (Version 3.4.3). Waseda University. https://www.laurenceanthony.net/software/antconc/. Accessed 27 July 2020.

3. Areta, N., Gurrutxaga, A., Leturia, I., Alegria, I., Artola, X., Diaz de Ilarraza, A., et al. (2007). ZT corpus: Annotation and tools for Basque corpora. Paper presented at the Corpus Linguistics Conference, Birmingham.

4. Aston, G., & Burnard, L. (1998). The BNC handbook: exploring the British National Corpus with SARA. Edinburgh: Edinburgh University Press.

5. Atkins, S., Clear, J., & Ostler, N. (1992). Corpus design criteria. Literary and Linguistic Computing, 7(1), 1–16.

Cited by 9 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. English Syntactic Analysis and Word Sense Disambiguation Strategy of Neutral Set from the Perspective of Natural Language Processing;Advances in Multimedia;2022-08-08

2. Coordinated Development of Languages in Multiethnic Inhabited Areas by Big Data Algorithm;Mobile Information Systems;2022-07-14

3. Design of Chinese Corpus Based on Semantic Mining Algorithm;Advances in Multimedia;2022-07-05

4. Corpora in Applied Linguistics;CAM APPL L;2022-04-21

5. English–Welsh Cross-Lingual Embeddings;Applied Sciences;2021-07-16