Benchmarking Unsupervised Outlier Detection with Realistic Synthetic Data-Reference-Cited by-同舟云学术

Benchmarking Unsupervised Outlier Detection with Realistic Synthetic Data

Published:2021-08-31 Issue:4 Volume:15 Page:1-20
ISSN:1556-4681
Container-title:ACM Transactions on Knowledge Discovery from Data
language:en
Short-container-title:ACM Trans. Knowl. Discov. Data

Author:

Steinbuss Georg¹^ORCID,Böhm Klemens¹

Affiliation:

1. Karlsruhe Institute of Technology (KIT), Germany

Abstract

Benchmarking unsupervised outlier detection is difficult. Outliers are rare, and existing benchmark data contains outliers with various and unknown characteristics. Fully synthetic data usually consists of outliers and regular instances with clear characteristics and thus allows for a more meaningful evaluation of detection methods in principle. Nonetheless, there have only been few attempts to include synthetic data in benchmarks for outlier detection. This might be due to the imprecise notion of outliers or to the difficulty to arrive at a good coverage of different domains with synthetic data. In this work, we propose a generic process for the generation of datasets for such benchmarking. The core idea is to reconstruct regular instances from existing real-world benchmark data while generating outliers so that they exhibit insightful characteristics. We propose and describe a generic process for the benchmarking of unsupervised outlier detection, as sketched so far. We then describe three instantiations of this generic process that generate outliers with specific characteristics, like local outliers. To validate our process, we perform a benchmark with state-of-the-art detection methods and carry out experiments to study the quality of data reconstructed in this way. Next to showcasing the workflow, this confirms the usefulness of our proposed process. In particular, our process yields regular instances close to the ones from real data. Summing up, we propose and validate a new and practical process for the benchmarking of unsupervised outlier detection.

Funder

Deutsche Forschungsgemeinschaft

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Link

https://dl.acm.org/doi/pdf/10.1145/3441453

Reference50 articles.

1. Pair-copula constructions of multiple dependence

2. Synthetic Generation of High-Dimensional Datasets

3. A weighted k-nearest neighbor density estimate for geometric inference

Cited by 14 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Outlier Detection in Auditing: Integrating Unsupervised Learning within a Multilevel Framework for General Ledger Analysis;Journal of Information Systems;2024-06-14

2. Understanding the limitations of self-supervised learning for tabular anomaly detection;Pattern Analysis and Applications;2024-03-12

3. Synthetic Data Generation;Advances in Business Information Systems and Analytics;2024-01-16

4. Using Autonomous Outlier Detection Methods for Thermophysical Property Data;Journal of Chemical & Engineering Data;2024-01-12

5. A General Framework for the Assessment of Detectors of Anomalies in Time Series;IEEE Transactions on Industrial Informatics;2024