The International Comparable Corpus: Challenges in building multilingual spoken and written comparable corpora-Reference-Cited by-同舟云学术

The International Comparable Corpus: Challenges in building multilingual spoken and written comparable corpora

Published:2021 Issue:1 Volume:10 Page:89-103
ISSN:2243-4712
Container-title:Research in Corpus Linguistics
language:en
Short-container-title:RiCL

Author:

Čermáková Ann¹^ORCID,Jantunen Jarmo²^ORCID,Jauhiainen Tommi³^ORCID,Kirk John⁴^ORCID,Křen Michal¹^ORCID,Kupietz Marc⁵^ORCID,Uí Dhonnchadha Elaine⁶^ORCID

Affiliation:

1. Charles University

2. University of Jyväskylä

3. University of Helsinki

4. University of Vienna

5. Institut für Deutsche Sprache, Mannheim

6. Trinity College Dublin

Abstract

This paper reports on the efforts of twelve national teams in building the International Comparable Corpus (ICC; https://korpus.cz/icc) that will contain highly comparable datasets of spoken, written and electronic registers. The languages currently covered are Czech, Finnish, French, German, Irish, Italian, Norwegian, Polish, Slovak, Swedish and, more recently, Chinese, as well as English, which is considered to be the pivot language. The goal of the project is to provide much-needed data for contrastive corpus-based linguistics. The ICC corpus is committed to the idea of re-using existing multilingual resources as much as possible and the design is modelled, with various adjustments, on the International Corpus of English (ICE). As such, ICC will contain approximately the same balance of forty percent of written language and 60 percent of spoken language distributed across 27 different text types and contexts. A number of issues encountered by the project teams are discussed, ranging from copyright and data sustainability to technical advances in data distribution.

Publisher

Research in Corpus Linguistics

Subject

Ocean Engineering

Reference28 articles.

1. Aijmer, Karin and Bengt Altenberg eds. 2013. Advances in Corpus-based Contrastive Linguistics: Studies in Honour of Stig Johansson. Amsterdam: John Benjamins.

2. Bański, Piotr, Joachim Bingel, Nils Diewald, Elena Frick, Michael Hanl, Marc Kupietz, Piotr Pęzik, Carsten Schnober and Andreas Witt. 2013. KorAP: The new corpus analysis platform at IDS Mannheim. In Zygmunt Vetulani and Hans Uszkoreit eds. Human Language Technologies as a Challenge for Computer Science and Linguistics. Proceedings of the 6th Language and Technology Conference. Poznan: Uniwersytet im. Adama Mickiewicza w Poznaniu, 586–587.

3. Calzolari, Nicoletta, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asunción Moreno, Jan Odijk and Stelios Piperidis eds. 2016. Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016. Portorož: European Language Resources Association.

4. Cosma, Ruxandra and Marc Kupietz. 2019. On design, creation and use of the Reference Corpus of Contemporary Romanian and its analysis tools. CoRoLa, KorAP, DRuKoLA and EuReCo. Revue Roumaine de Linguistique, 64/3. Editura Academiei Române.

5. Crystal, David. 2004. The Language Revolution. London: John Wiley & Sons.

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Structural and semantic features of adjectives across languages and registers;Languages in Contrast;2024-02-16

2. Other Applications of Comparable Corpora;Building and Using Comparable Corpora for Multilingual Natural Language Processing;2023