Affiliation:
1. VISTA Lab, Algoritmi Center, University of Évora, Portugal
2. LASIGE, Faculdade de Ciências, Universidade de Lisboa, Portugal
3. INESC-ID and Instituto Superior Técnico, Universidade de Lisboa, Portugal
Abstract
The existence of large amounts of data increases the probability of occurring data quality problems. A data cleaning process that corrects these problems is usually an iterative process, because it may need to be re-executed and refined to produce high-quality data. Moreover, due to the specificity of some data quality problems and the limitation of data cleaning programs to cover all problems, often a user has to be involved during the program executions by manually repairing data. However, there is no data cleaning framework that appropriately supports this involvement in such an iterative process, a form of human-in-the-loop, to clean structured data. Moreover, data preparation tools that somehow involve the user in data cleaning processes have not been evaluated with real users to assess their effort.
Therefore, we propose Cleenex, a data cleaning framework with support for user involvement during an iterative data cleaning process, and conduct two data cleaning experimental evaluations: an assessment of the Cleenex components that support the user when manually repairing data with a simulated user; and a comparison, in terms of user involvement, of data preparation tools with real users.
Results show that Cleenex components reduce the user effort when manually cleaning data during a data cleaning process, for example, the number of tuples visualized is reduced in 99%. Moreover, when performing data cleaning tasks with Cleenex, real users need less time/effort (e.g., half the clicks) and, based on questionnaires, prefer it to the other tools used for comparison, OpenRefine and Pentaho Data Integration.
Funder
Fundação para a Ciência e a Tecnologia
Publisher
Association for Computing Machinery (ACM)
Reference56 articles.
1. Mohamed Abdelaal, Rashmi Koparde, and Harald Schoening. 2023. AutoCure: Automated tabular data curation technique for ML pipelines. In aiDM@SIGMOD.
2. Ahmad Assadi, Tova Milo, and Slava Novgorodov. 2018. Cleaning data with constraints and experts. In WebDB@SIGMOD.
3. A survey of data quality tools.;Barateiro José;Daten.-Spektr.,2005
4. A Proof Procedure for Data Dependencies
5. Leopoldo Bertossi. 2019. Database repairs and consistent query answering: Origins and further developments. In PODS.
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Mitigating Data Sparsity in Integrated Data through Text Conceptualization;2024 IEEE 40th International Conference on Data Engineering (ICDE);2024-05-13