Temporal rules discovery for web data cleaning-Reference-Cited by-同舟云学术

Temporal rules discovery for web data cleaning

Published:2015-12 Issue:4 Volume:9 Page:336-347
ISSN:2150-8097
Container-title:Proceedings of the VLDB Endowment
language:en
Short-container-title:Proc. VLDB Endow.

Author:

Abedjan Ziawasch¹,Akcora Cuneyt G.²,Ouzzani Mourad²,Papotti Paolo²,Stonebraker Michael¹

Affiliation:

1. MIT CSAIL

2. Qatar Computing Research Institute, HBKU

Abstract

Declarative rules, such as functional dependencies, are widely used for cleaning data. Several systems take them as input for detecting errors and computing a "clean" version of the data. To support domain experts, in specifying these rules, several tools have been proposed to profile the data and mine rules. However, existing discovery techniques have traditionally ignored the time dimension. Recurrent events, such as persons reported in locations, have a duration in which they are valid, and this duration should be part of the rules or the cleaning process would simply fail. In this work, we study the rule discovery problem for temporal web data. Such a discovery process is challenging because of the nature of web data; extracted facts are (i) sparse over time, (ii) reported with delays, and (iii) often reported with errors over the values because of inaccurate sources or non robust extractors. We handle these challenges with a new discovery approach that is more robust to noise. Our solution uses machine learning methods, such as association measures and outlier detection, for the discovery of the rules, together with an aggressive repair of the data in the mining step itself. Our experimental evaluation over real-world data from Recorded Future, an intelligence company that monitors over 700K Web sources, shows that temporal rules improve the quality of the data with an increase of the average precision in the cleaning process from 0.37 to 0.84, and a 40% relative increase in the average F-measure.

Publisher

VLDB Endowment

Subject

General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development

Link

https://dl.acm.org/doi/pdf/10.14778/2856318.2856328

Cited by 30 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Humans-in-the-loop: Gamifying activity label repair in process event logs;Engineering Applications of Artificial Intelligence;2024-06

2. Computing Maximal Likelihood Subset Repair for Inconsistent Data;Lecture Notes in Computer Science;2024

3. Computing Minimum Subset Repair on Incomplete Data;Lecture Notes in Computer Science;2024

4. Exploratory Training: When Annonators Learn About Data;Proceedings of the ACM on Management of Data;2023-06-13

5. Sudowoodo: Contrastive Self-supervised Learning for Multi-purpose Data Integration and Preparation;2023 IEEE 39th International Conference on Data Engineering (ICDE);2023-04