Abstract
PurposeThe present study is about generating metadata to enhance thematic transparency and facilitate research on interview collections at the Research Documentation Centre, Centre for Social Sciences (TK KDK) in Budapest. It explores the use of artificial intelligence (AI) in producing, managing and processing social science data and its potential to generate useful metadata to describe the contents of such archives on a large scale.Design/methodology/approachThe authors combined manual and automated/semi-automated methods of metadata development and curation. The authors developed a suitable domain-oriented taxonomy to classify a large text corpus of semi-structured interviews. To this end, the authors adapted the European Language Social Science Thesaurus (ELSST) to produce a concise, hierarchical structure of topics relevant in social sciences. The authors identified and tested the most promising natural language processing (NLP) tools supporting the Hungarian language. The results of manual and machine coding will be presented in a user interface.FindingsThe study describes how an international social scientific taxonomy can be adapted to a specific local setting and tailored to be used by automated NLP tools. The authors show the potential and limitations of existing and new NLP methods for thematic assignment. The current possibilities of multi-label classification in social scientific metadata assignment are discussed, i.e. the problem of automated selection of relevant labels from a large pool.Originality/valueInterview materials have not yet been used for building manually annotated training datasets for automated indexing of scientifically relevant topics in a data repository. Comparing various automated-indexing methods, this study shows a possible implementation of a researcher tool supporting custom visualizations and the faceted search of interview collections.
Subject
Library and Information Sciences,Information Systems
Reference23 articles.
1. Comparing and combining machine learning and dictionary-based approaches to topic coding;In Conference paper from the 7th annual Comparative Agendas Project (CAP) conference in Konstanz,2014
2. What determines inter-coder agreement in manual annotations? A meta-analytic investigation;Computational Linguistics,2011
3. OHMS: enhancing access to oral history for free;The Oral History Review,2013
4. Boyd, D.A. and Larson, M.A. (Eds), (2014) Oral History and Digital Humanities: Voice, Access, and Engagement, Palgrave Macmillan US, New York, doi: 10.1057/9781137322029.
5. Taming pretrained transformers for extreme multi-label text classification,2020