Abstract
The median absolute deviation (MAD) is a statistic measuring the variability of a set of quantitative elements. It is known to be more robust to outliers than the standard deviation (SD), and thereby widely used in outlier detection. Computing the exact MAD however is costly, e.g., by calling an algorithm of finding median twice, with space cost
O
(
n
) over
n
elements in a set. In this paper, we propose the first fully mergeable approximate MAD algorithm, OP-MAD, with one-pass scan of the data. Remarkably, by calling the proposed algorithm at most twice, namely TP-MAD, it guarantees to return an (ϵ, 1)-accurate MAD, i.e., the error relative to the exact MAD is bounded by the desired ϵ or 1. The space complexity is reduced to
O
(
m
) while the time complexity is
O
(
n
+
m
log
m
), where
m
is the size of the sketch used to compress data, related to the desired error bound ϵ. To get a more accurate MAD, i.e., with smaller ϵ, the sketch size
m
will be larger, a trade-off between effectiveness and efficiency. In practice, we often have the sketch size
m
≪
n
, leading to constant space cost
O
(1) and linear time cost
O
(
n
). The extensive experiments over various datasets demonstrate the superiority of our solution, e.g., 160000× less memory and 18x faster than the aforesaid exact method in datasets
pareto
and
norm
. Finally, we further implement and evaluate the parallelizable TP-MAD in Apache Spark, and the fully mergeable OP-MAD in Structured Streaming.
Subject
General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development
Cited by
8 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献