Affiliation:
1. Leibniz Institute for the German Language
Abstract
Abstract
We introduce DeReKoGram, a novel frequency dataset containing lemma and part-of-speech (POS) information for 1-, 2-, and 3-grams from the German Reference Corpus. The dataset contains information based on a corpus of 43.2 billion tokens and is divided into 16 parts based on 16 corpus folds. We describe how the dataset was created and structured. By evaluating the distribution over the 16 folds, we show that it is possible to work with a subset of the folds in many use cases (e.g. to save computational resources). In a case study, we investigate the growth of the vocabulary (as well as the number of hapax legomena) as more and more folds are included into the analysis. We cross-combine this with several cleaning stages of the dataset. We also give some guidance in the form of Python, R and Stata markdown scripts on how to work with the resource.
Publisher
Research Square Platform LLC
Reference28 articles.
1. Aumasson, J. P., Meier, W., Phan, R. C. W., & Henzen, L. (2014). BLAKE2. The Hash Function BLAKE (pp. 165–183). Springer.
2. Productivity in language production;Baayen RH;Language and Cognitive Processes,1994
3. The Effects of Lexical Specialization on the Growth Curve of the Vocabulary;Baayen RH;Computational Linguistics,1996
4. Brants, T., Popat, A. C., Xu, P., Och, F. J., & Dean, J. (2007). Large Language Models in Machine Translation. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), 858–867. https://aclanthology.org/D07-1090.
5. Brysbaert, M., Stevens, M., Mandera, P., & Keuleers, E. (2016). How Many Words Do We Know? Practical Estimates of Vocabulary Size Dependent on Word Definition, the Degree of Language Input and the Participant’s Age. Frontiers in Psychology, 7. https://doi.org/10.3389/fpsyg.2016.01116.
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献