SimClone: Detecting Tabular Data Clones using Value Similarity-Reference-Cited by-同舟云学术

SimClone: Detecting Tabular Data Clones using Value Similarity

Published:2024-07-16 Issue: Volume: Page:
ISSN:1049-331X
Container-title:ACM Transactions on Software Engineering and Methodology
language:en
Short-container-title:ACM Trans. Softw. Eng. Methodol.

Author:

Yang Xu¹^ORCID,Rajbahadur Gopi Krishnan²^ORCID,Lin Dayi²^ORCID,Wang Shaowei¹^ORCID,Jiang Zhen Ming (Jack)³^ORCID

Affiliation:

1. University of Manitoba, Canada

2. Centre for Software Excellence, Huawei Canada, Canada

3. York University, Canada

Abstract

Data clones are defined as multiple copies of the same data among datasets. Presence of data clones between datasets can cause issues such as difficulties in managing data assets and data license violations when using datasets with clones to build AI software. However, detecting data clones is not trivial. Majority of the prior studies in this area rely on structural information to detect data clones (e.g., font size, column header). However, tabular datasets used to build AI software are typically stored without any structural information. In this paper, we propose a novel method called SimClone for data clone detection in tabular datasets without relying on structural information. SimClone method utilizes value similarities for data clone detection. We also propose a visualization approach as a part of our SimClone method to help locate the exact position of the cloned data between a dataset pair. Our results show that our SimClone outperforms the current state-of-the-art method by at least 20% in terms of both F1-score and AUC. In addition, SimClone’s visualization component helps identify the exact location of the data clone in a dataset with a Precision@10 value of 0.80 in the top 20 true positive predictions.

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3676961

Reference85 articles.

1. 2023. Repository for SimClone. https://zenodo.org/record/7613379#.Y-FksuzMJQ0

2. Qurat Ul Ain, Wasi Haider Butt, Muhammad Waseem Anwar, Farooque Azam, and Bilal Maqbool. 2019. A systematic review on code clone detection. IEEE access 7 (2019), 86121–86144.

3. Ibrahim Alabdulmohsin, Jessica Schrouff, and Oluwasanmi Koyejo. 2022. A reduction to binary approach for debiasing multiclass datasets. arXiv preprint arXiv:2205.15860 (2022).

4. The adverse effects of code duplication in machine learning models of code

5. A survey on data leakage prevention systems