Selecting More Informative Training Sets with Fewer Observations-Reference-Cited by-同舟云学术

Selecting More Informative Training Sets with Fewer Observations

Published:2023-06-08 Issue:1 Volume:32 Page:133-139
ISSN:1047-1987
Container-title:Political Analysis
language:en
Short-container-title:Polit. Anal.

Author:

Kaufman Aaron R.^ORCID

Abstract

AbstractA standard text-as-data workflow in the social sciences involves identifying a set of documents to be labeled, selecting a random sample of them to label using research assistants, training a supervised learner to label the remaining documents, and validating that model’s performance using standard accuracy metrics. The most resource-intensive component of this is the hand-labeling: carefully reading documents, training research assistants, and paying human coders to label documents in duplicate or more. We show that hand-coding an algorithmically selected rather than a simple-random sample can improve model performance above baseline by as much as 50%, or reduce hand-coding costs by up to two-thirds, in applications predicting (1) U.S. executive-order significance and (2) financial sentiment on social media. We accompany this manuscript with open-source software to implement these tools, which we hope can make supervised learning cheaper and more accessible to researchers.

Publisher

Cambridge University Press (CUP)

Subject

Political Science and International Relations,Sociology and Political Science

Reference26 articles.

1. What Drives Media Slant? Evidence from US Daily Newspapers;Gentzkow;Econometrica,2010

2. Machine Learning for Social Science: An Agnostic Approach

3. Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs

4. How the Public Defines Terrorism

5. Inferring Roll-Call Scores from Campaign Contributions Using Supervised Machine Learning