Approximating median absolute deviation with bounded error-Reference-Cited by-同舟云学术

Approximating median absolute deviation with bounded error

Published:2021-07 Issue:11 Volume:14 Page:2114-2126
ISSN:2150-8097
Container-title:Proceedings of the VLDB Endowment
language:en
Short-container-title:Proc. VLDB Endow.

Author:

Chen Zhiwei¹,Song Shaoxu¹,Wei Ziheng²,Fang Jingyun²,Long Jiang²

Affiliation:

1. Tsinghua University

2. HUAWEI Cloud BU

Abstract

The median absolute deviation (MAD) is a statistic measuring the variability of a set of quantitative elements. It is known to be more robust to outliers than the standard deviation (SD), and thereby widely used in outlier detection. Computing the exact MAD however is costly, e.g., by calling an algorithm of finding median twice, with space cost O ( n ) over n elements in a set. In this paper, we propose the first fully mergeable approximate MAD algorithm, OP-MAD, with one-pass scan of the data. Remarkably, by calling the proposed algorithm at most twice, namely TP-MAD, it guarantees to return an (ϵ, 1)-accurate MAD, i.e., the error relative to the exact MAD is bounded by the desired ϵ or 1. The space complexity is reduced to O ( m ) while the time complexity is O ( n + m log m ), where m is the size of the sketch used to compress data, related to the desired error bound ϵ. To get a more accurate MAD, i.e., with smaller ϵ, the sketch size m will be larger, a trade-off between effectiveness and efficiency. In practice, we often have the sketch size m ≪ n , leading to constant space cost O (1) and linear time cost O ( n ). The extensive experiments over various datasets demonstrate the superiority of our solution, e.g., 160000× less memory and 18x faster than the aforesaid exact method in datasets pareto and norm . Finally, we further implement and evaluate the parallelizable TP-MAD in Apache Spark, and the fully mergeable OP-MAD in Structured Streaming.

Publisher

VLDB Endowment

Subject

General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development

Link

https://dl.acm.org/doi/pdf/10.14778/3476249.3476266

Cited by 8 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Determining Exact Quantiles with Randomized Summaries;Proceedings of the ACM on Management of Data;2024-03-12

2. IoTDQ: An Industrial IoT Data Analysis Library for Apache IoTDB;Big Data Mining and Analytics;2024-03

3. Detecting potential outliers in longitudinal data with time-dependent covariates;European Journal of Clinical Nutrition;2024-01-03

4. Influence of the physical effort of reminder-setting on strategic offloading of delayed intentions;Quarterly Journal of Experimental Psychology;2023-09-23

5. CORE-Sketch: On Exact Computation of Median Absolute Deviation with Limited Space;Proceedings of the VLDB Endowment;2023-07