Affiliation:
1. KU Leuven
2. KU Leuven & Radboud University
Abstract
Abstract
In this article, we present a new corpus spanning 163 years of written Dutch. This Dutch Corpus of Contemporary and late Modern Periodicals (Dutch C-CLAMP) comprises 47,738 part-of-speech tagged articles published in Dutch periodicals from 1837 until 1999, totaling approximately 200 million tokens in size. We explain the measures we took to overcome the shortcomings of existing corpora of historical Dutch covering the same period. We provide a detailed description of how the corpus has been compiled and enriched. Several aspects are covered: text-markup, preprocessing of the data, including foreign language recognition and spelling normalization, and the enrichment of both textual data as well as metadata of the authors of the corpus files. We also carry out two case studies to illustrate the reliability of the corpus.
Publisher
Amsterdam University Press
Subject
General Earth and Planetary Sciences,General Environmental Science
Reference69 articles.
1. Polyglot: Distributed Word Representations for Multilingual NLP;Proceedings of the Seventeenth Conference on Computational Natural Language Learning,2013
2. Grammaticalization and the linguistic individual: new avenues in lifespan research;Linguistics Vanguard,2019
3. Modeling language change across the lifespan: individual trajectories in community change;Language Variation and Change,2016
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献