Author:
Barcaroli Guilio,Nurra Alessandra,Salamone Sergio,Scannapieco Monica,Scarnò Marco,Summa Donato
Abstract
The Istat sampling survey on ICT in enterprises aims at producing information onthe use of ICT and in particular on the use of Internet by Italian enterprises for various purposes (e-commerce, e-recruitment, advertisement, e-tendering, e-procurement, egovernment). To such a scope, data are collected by means of the traditional instrument of the questionnaire. Istat began to explore the possibility to use web scraping techniques, associated, in the estimation phase, to text and data mining algorithms, with the aim to replace traditional instruments of data collection and estimation, or to combine them in an integrated approach. The 8,600 websites, indicated by the 19,000 enterprises responding to ICT survey of year 2013, have been scraped and the acquired texts have been processed in order to try to reproduce the same information collected via questionnaire. Preliminary results are encouraging, showing in some cases a satisfactory predictive capability of fittedmodels (mainly those obtained by using the Naive Bayes algorithm). Also the method known as Content Analysis has been applied, and its results compared to those obtained with classical learners. In order to improve the overall performance, an advanced system for scraping and mining is being adopted, based on the open source Apache suite Nutch-Solr-Lucene. On the basis of the nal results of this test, an integrated system harnessing both survey data and data collected from Internet to produce the required estimates will be implemented, based on systematic scraping of the near 100,000 websites related to the whole population of Italian enterprises with 10 persons employed and more, operating in industry and services. This new approach, based on Internet as Data source (IaD), is characterized by advantages and drawbacks that need to be carefully analysed.
Publisher
Austrian Statistical Society
Subject
Applied Mathematics,Statistics, Probability and Uncertainty,Statistics and Probability
Cited by
10 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献