Abstract
AbstractData analysis often uses data sets that were collected for different purposes. Indeed, new insights are often obtained by combining data sets that were produced independently of each other, for example by combining data from outside an organization with internal data resources. As a result, there is a need to discover, clean, integrate and restructure data into a form that is suitable for an intended analysis. Data preparation, also known as data wrangling, is the process by which data are transformed from its existing representation into a form that is suitable for analysis. In this paper, we review the state-of-the-art in data preparation, by: (i) describing functionalities that are central to data preparation pipelines, specifically profiling, matching, mapping, format transformation and data repair; and (ii) presenting how these capabilities surface in different approaches to data preparation, that involve programming, writing workflows, interacting with individual data sets as tables, and automating aspects of the process. These functionalities and approaches are illustrated with reference to a running example that combines open government data with web extracted real estate data.
Funder
Engineering and Physical Sciences Research Council
Horizon 2020 Framework Programme
Publisher
Springer Science and Business Media LLC
Subject
Computer Science Applications,Computer Networks and Communications,Computer Graphics and Computer-Aided Design,Computational Theory and Mathematics,Artificial Intelligence,General Computer Science
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献