Affiliation:
1. UC Arts Digital Lab, University of Canterbury , Christchurch, New Zealand
2. New Zealand Institute of Language, Brain and Behaviour, University of Canterbury , Christchurch, New Zealand
Abstract
AbstractThe availability of large digital archives of historical newspaper content has transformed the historical sciences. However, the scale of these archives can limit the direct application of advanced text processing methods. Even if it is computationally feasible to apply sophisticated language processing to an entire digital archive, if the material of interest is a small fraction of the archive, the results are unlikely to be useful. Methods for generating smaller specialized corpora from large archives are required to solve this problem. This article presents such a method for historical newspaper archives digitized using the METS/ALTO XML standard (Veridian Software, n.d.). The method is an ‘iterative bootstrapping’ approach in which candidate corpora are evaluated using text mining techniques, items are manually labelled, and Naïve Bayes text classifiers are trained and applied in order to produce new candidate corpora. The method is illustrated by a case study that investigates philosophical content, broadly construed, in pre-1900 English-language New Zealand newspapers. Extensive code is provided in Supplementary Materials.
Publisher
Oxford University Press (OUP)
Subject
Computer Science Applications,Linguistics and Language,Language and Linguistics,Information Systems
Reference57 articles.
1. Identifying virtues and values through obituary data-mining;Alfano;Journal of Value Inquiry,2018
2. Using Corpora in Discourse Analysis
3. Reading the newspaper in Colonial Otago;Ballantyne;The Journal of New Zealand Studies,2012
4. Corpus construction: a principle for qualitative data collection;Bauer;Qualitative Researching with Text, Image and Sound: A Practical Handbook,2000