Engaging with bad (meta)data in historical corpus linguistics-Reference-Cited by-同舟云学术

Engaging with bad (meta)data in historical corpus linguistics

Published:2024-09-15 Issue: Volume: Page:9-34
ISSN:1388-0373
Container-title:Studies in Corpus Linguistics
language:en
Short-container-title:

Author:

Vartiainen Turo¹^ORCID,Säily Tanja¹^ORCID

Affiliation:

1. University of Helsinki

Abstract

In this chapter, we discuss some common pitfalls related to historical data and its use in linguistic analysis. We argue that the “philologist’s dilemma”, as originally proposed by Rissanen (1989), should be reconceptualized to meet the needs of the fast-evolving field of corpus linguistics, where scholars make increasing use of big-data resources and sophisticated statistical modelling. By providing examples of errors and uncertainties related to, for example, corpus metadata, sampling, balance, and OCR accuracy, we argue that corpus linguists should pay increasingly close attention to the sampling and annotation principles employed in the compilation of historical corpora as well as to the quality of the linguistic data. We propose that the principle of “knowing one’s corpus” in terms of its compilation principles has become all the more important in the age of big-data corpora, where it is not feasible for individual researchers, or corpus compilers, to validate their data manually.

Publisher

John Benjamins Publishing Company

Reference41 articles.

1. CCOHA: Clean Corpus of Historical American English;Alatrash,2020

2. On frequency, transparency and productivity

3. Diachronic relations among speech-based and written registers in English;Biber,1997

4. The historical shift of scientific academic prose in English towards less explicit styles of expression