Abstract
AbstractIn text, images, merged surveys, voter files, and elsewhere, data sets are often missing important covariates, either because they are latent features of observations (such as sentiment in text) or because they are not collected (such as race in voter files). One promising approach for coping with this missing data is to find the true values of the missing covariates for a subset of the observations and then train a machine learning algorithm to predict the values of those covariates for the rest. However, plugging in these predictions without regard for prediction error renders regression analyses biased, inconsistent, and overconfident. We characterize the severity of the problem posed by prediction error, describe a procedure to avoid these inconsistencies under comparatively general assumptions, and demonstrate the performance of our estimators through simulations and a study of hostile political dialogue on the Internet. We provide software implementing our approach.
Publisher
Cambridge University Press (CUP)
Subject
Political Science and International Relations,Sociology and Political Science
Reference23 articles.
1. Dimitriadou, E. , Hornik, K. , Leisch, F. , Meyer, D. , Weingessel, A. , and Leisch, M. F. (2009). “Package ‘e1071’.” R Software package, http://cran.rproject.org/web/packages/e1071/index.html.
2. Race and Representation in Campaign Finance
3. Anti-Americanism and Anti-Interventionism in Arabic Twitter Discourses
Cited by
16 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献