TemporalDedup: Domain-Independent Deduplication of Redundant and Errant Temporal Data-Reference-Cited by-同舟云学术

TemporalDedup: Domain-Independent Deduplication of Redundant and Errant Temporal Data

Published:2023-04-18 Issue:02 Volume:17 Page:309-343
ISSN:1793-351X
Container-title:International Journal of Semantic Computing
language:en
Short-container-title:Int. J. Semantic Computing

Author:

Rogers Jon¹,Aygun Ramazan²,Etzkorn Letha¹

Affiliation:

1. Department of Computer Science, University of Alabama in Huntsville, 301 Sparkman Drive, Huntsville, Alabama 35899, USA

2. Department of Computer Science, Kennesaw State University, 1100 South Marietta Parkway SE, Marietta, Georgia 30060, USA

Abstract

Deduplication is a key component of the data preparation process, a bottleneck in the machine learning (ML) and data mining pipeline that is very time-consuming and often relies on domain expertise and manual involvement. Further, temporal data is increasingly prevalent and is not well suited to traditional similarity and distance-based deduplication techniques. We establish a fully automated, domain-independent deduplication model for temporal data domains, known as TemporalDedup, that infers the key attribute(s), applies a base set of deduplication techniques focused on value matches for key, non-key, and elapsed time, and further detects duplicates through inference of temporal ordering requirements using Longest Common Subsequence (LCS) for records of a shared type. Using LCS, we split each record’s temporal sequence into constrained and unconstrained sequences. We flag suspicious (errant) records that are non-adherent to the inferred constrained order and we flag a record as a duplicate if its unconstrained order, of sufficient length, matches that of another record. TemporalDedup was compared against a similarity-based Adaptive Sorted Neighborhood Method (ASNM) in evaluating duplicates for two disparate datasets: (1) 22,794 records from Sony’s PlayStation Network (PSN) trophy data, where duplication may be indicative of cheating, and (2) emergency declarations and government responses related to COVID-19 for all U.S. states and territories. TemporalDedup (F1-scores of 0.971 and 0.954) exhibited combined sensitivities above 0.9 for all duplicate classes whereas ASNM (0.705 and 0.732) exhibited combined sensitivities below 0.2 for all time and order duplicate classes.

Publisher

World Scientific Pub Co Pte Ltd

Subject

Artificial Intelligence,Computer Networks and Communications,Computer Science Applications,Linguistics and Language,Information Systems,Software

Link

https://www.worldscientific.com/doi/pdf/10.1142/S1793351X23500010

Reference37 articles.

1. Communications in Computer and Information Science;Wangikar V.,2021

2. R-Dedup: Secure client-side deduplication for encrypted data without involving a third-party entity

3. Secure Encrypted Data Deduplication Based on Data Popularity

4. SecDedoop: Secure Deduplication with Access Control of Big Data in the HDFS/Hadoop Environment

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Temporal information retrieval using bitwise operators;Information Retrieval Journal;2023-09-23