Affiliation:
1. Sogeti Luxembourg , Bertrange , Luxembourg
Abstract
Abstract
I propose a solution to content removal bias in statistics from web scraped data. Content removal bias occurs when data is removed from the web before a scraper is able to collect it. The solution I propose is based on inverse probability weights, derived from the parameters of a survival function with complex forms of data censoring. I apply this solution to the calculation of the proportion of newly built dwellings with web scraped data on Luxembourg, and I run a counterfactual experiment and a Montecarlo simulation to confirm the findings. The results show that the extent of content removal bias is relatively small if the scraping occurs frequently compared with the online permanence of the data; and that it grows larger with less frequent scraping.
Subject
General Earth and Planetary Sciences,General Environmental Science
Reference21 articles.
1. Ascheri A., Kiss Nagy A., Marconi G., Meszaros M., Paulino R., Reis F. (2021). Competition in urban hiring markets: evidence from online job advertisements. Eurostat Statistical Working Papers, http://dx.doi.org/10.2785/667004
2. Beblavý, M., Fabo, B. and Lenaerts, K. (2016). Demand for Digital Skills in the US Labour Market: The IT Skills Pyramid. CEPS Special Report No. 154, Available at SSRN: https://ssrn.com/abstract=3047102
3. Brüning, N. and Mangeol, P. (2020). What skills do employers seek in graduates? Using online job posting data to support policy and practice in higher education. OECD Education Working Papers, No. 231, https://doi.org/10.1787/bf533d35-en
4. Bricongne, J., Meunier, B. and Sylvain, P. (2021). Web Scraping Housing Prices in Real-time: the Covid-19 Crisis in the UK. Banque de France Working Paper No. 827, http://dx.doi.org/10.2139/ssrn.3916196
5. Cole S., Hernán M. (2008). Constructing inverse probability weights for marginal structural models. American Journal of Epidemiology, 168, 656-64 doi: 10.1093/aje/kwn164.