Towards Automatic Data Format Transformations: Data Wrangling at Scale-Reference-Cited by-同舟云学术

Towards Automatic Data Format Transformations: Data Wrangling at Scale

Published:2018-12-01 Issue:7 Volume:62 Page:1044-1060
ISSN:0010-4620
Container-title:The Computer Journal
language:en
Short-container-title:

Author:

Bogatu Alex¹,Paton Norman W¹,Fernandes Alvaro A A¹,Koehler Martin¹^ORCID

Affiliation:

1. School of Computer Science, University of Manchester, Manchester, UK

Abstract

Abstract Data wrangling is the process whereby data are cleaned and integrated for analysis. Data wrangling, even with tool support, is typically a labour intensive process. One aspect of data wrangling involves carrying out format transformations on attribute values, for example so that names or phone numbers are represented consistently. Recent research has developed techniques for synthesizing format transformation programs from examples of the source and target representations. This is valuable, but still requires a user to provide suitable examples, something that may be challenging in applications in which there are huge datasets or numerous data sources. In this paper, we investigate the automatic discovery of examples that can be used to synthesize format transformation programs. In particular, we propose two approaches to identifying candidate data examples and validating the transformations that are synthesized from them. The approaches are evaluated empirically using datasets from open government data.

Funder

Engineering and Physical Sciences Research council

Publisher

Oxford University Press (OUP)

Subject

General Computer Science

Link

http://academic.oup.com/comjnl/article-pdf/62/7/1044/28952449/bxy118.pdf

Reference22 articles.

1. Blinkfill: Semi-supervised programming by example for syntactic string transformations;Singh;Proc. VLDB Endowment,2016

Cited by 5 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A large reproducible benchmark on text classification for the legal domain based on the ECHR-OD repository;Information Systems;2023-10

2. Data Preparation: A Technological Perspective and Review;SN Computer Science;2023-06-02

3. Explaining Dataset Changes for Semantic Data Versioning with Explain-Da-V;Proceedings of the VLDB Endowment;2023-02

4. Advances on Data Management and Information Systems;Information Systems Frontiers;2022-02

5. VADA: an architecture for end user informed data preparation;Journal of Big Data;2019-08-21