Leakage in data mining-Reference-Cited by-同舟云学术

Leakage in data mining

Published:2012-12 Issue:4 Volume:6 Page:1-21
ISSN:1556-4681
Container-title:ACM Transactions on Knowledge Discovery from Data
language:en
Short-container-title:ACM Trans. Knowl. Discov. Data

Author:

Kaufman Shachar¹,Rosset Saharon¹,Perlich Claudia²,Stitelman Ori²

Affiliation:

1. Tel Aviv University, Israel

2. m6d

Abstract

Deemed “one of the top ten data mining mistakes”, leakage is the introduction of information about the data mining target that should not be legitimately available to mine from. In addition to our own industry experience with real-life projects, controversies around several major public data mining competitions held recently such as the INFORMS 2010 Data Mining Challenge and the IJCNN 2011 Social Network Challenge are evidence that this issue is as relevant today as it has ever been. While acknowledging the importance and prevalence of leakage in both synthetic competitions and real-life data mining projects, existing literature has largely left this idea unexplored. What little has been said turns out not to be broad enough to cover more complex cases of leakage, such as those where the classical independently and identically distributed (i.i.d.) assumption is violated, that have been recently documented. In our new approach, these cases and others are explained by explicitly defining modeling goals and analyzing the broader framework of the data mining problem. The resulting definition enables us to derive general methodology for dealing with the issue. We show that it is possible to avoid leakage with a simple specific approach to data management followed by what we call a learn-predict separation, and present several ways of detecting leakage when the modeler has no control over how the data have been collected. We also offer an alternative point of view on leakage that is based on causal graph modeling concepts.

Funder

Israel Science Foundation

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Link

https://dl.acm.org/doi/pdf/10.1145/2382577.2382579

Reference23 articles.

1. Co-Integration and Error Correction: Representation, Estimation, and Testing

2. Hastie T. Tibshirani R. and Friedman J. H. 2009. The Elements of Statistical Learning: Data Mining Inference and Prediction. Springer. Hastie T. Tibshirani R. and Friedman J. H. 2009. The Elements of Statistical Learning: Data Mining Inference and Prediction. Springer.

3. Kohavi R. Brodley C. Frasca B. Mason L. and Zheng Z. 2000. Kdd-Cup 2000 organizers' report: Peeling the onion. ACM SIGKDD Explor. Newslett. 2. 10.1145/380995.381033 Kohavi R. Brodley C. Frasca B. Mason L. and Zheng Z. 2000. Kdd-Cup 2000 organizers' report: Peeling the onion. ACM SIGKDD Explor. Newslett. 2. 10.1145/380995.381033

4. Lessons and Challenges from Mining Retail E-Commerce Data

Cited by 307 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Prediction of long-term creep modulus of thermoplastics using brief tests and interpretable machine learning;International Journal of Solids and Structures;2024-11

2. Machine learning for predicting protein properties: A comprehensive review;Neurocomputing;2024-09

3. Predictive analysis on the factors associated with birth Outcomes: A machine learning perspective;International Journal of Medical Informatics;2024-09

4. Obtaining the Most Accurate, Explainable Model for Predicting Chronic Obstructive Pulmonary Disease: Triangulation of Multiple Linear Regression and Machine Learning Methods;JMIR AI;2024-08-29

5. Graph Machine Learning Meets Multi-Table Relational Data;Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining;2024-08-24