Affiliation:
1. Harbin Institute of Technology
2. University of California, Irvine
3. Tsinghua University
Abstract
For big data, data quality problem is more serious. Big data cleaning system requires scalability and the abilityof handling mixed errors. Motivated by this, we develop Cleanix, a prototype system for cleaning relational Big Data. Cleanix takes data integrated from multiple data sources and cleans them on a shared-nothing machine cluster. The backend system is built on-top-of an extensible and flexible data-parallel substrate the Hyracks framework. Cleanix supports various data cleaning tasks such as abnormal value detection and correction, incomplete data filling, de-duplication, and conflict resolution. In this paper, we show the organization, data cleaning algorithms as well as the design of Cleanix.
Publisher
Association for Computing Machinery (ACM)
Subject
Information Systems,Software
Cited by
14 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献