Affiliation:
1. MIT
2. AT&T Labs--Research
3. University of Michigan
Abstract
The problem of finding heavy hitters and approximating the frequencies of items is at the heart of many problems in data stream analysis. It has been observed that several proposed solutions to this problem can outperform their worst-case guarantees on real data. This leads to the question of whether some stronger bounds can be guaranteed. We answer this in the positive by showing that a class of counter-based algorithms (including the popular and very space-efficient
Frequent
and
SpacesSaving
algorithms) provides much stronger approximation guarantees than previously known. Specifically, we show that errors in the approximation of individual elements do not depend on the frequencies of the most frequent elements, but only on the frequency of the remaining tail. This shows that counter-based methods are the most space-efficient (in fact, space-optimal) algorithms having this strong error bound.
This tail guarantee allows these algorithms to solve the sparse recovery problem. Here, the goal is to recover a faithful representation of the vector of frequencies,
f
. We prove that using space
O
(
k
), the algorithms construct an approximation
f
* to the frequency vector
f
so that the
L
1
error ∥∥
f
−∥
f
*∥
1
is close to the best possible error min
f
′
∥
f
′ −
f
∥
1
, where
f′
ranges over all vectors with at most
k
non-zero entries. This improves the previously best known space bound of about
O
(
k
log
n
) for streams without element deletions (where
n
is the size of the domain from which stream elements are drawn). Other consequences of the tail guarantees are results for skewed (Zipfian) data, and guarantees for accuracy of merging multiple summarized streams.
Publisher
Association for Computing Machinery (ACM)
Reference41 articles.
1. Space-optimal heavy hitters with strong error bounds
2. Bestavros A. Crovella M. and Taqqu T. 1999. Heavy-Tailed Probability Distributions in the World Wide Web. Birkhäuser 3--25. Bestavros A. Crovella M. and Taqqu T. 1999. Heavy-Tailed Probability Distributions in the World Wide Web. Birkhäuser 3--25.
Cited by
30 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. DISCO: A Dynamically Configurable Sketch Framework in Skewed Data Streams;2024 IEEE 40th International Conference on Data Engineering (ICDE);2024-05-13
2. Randomized counter-based algorithms for frequency estimation over data streams in O(loglogN) space;Theoretical Computer Science;2024-02
3. Compact Frequency Estimators in Adversarial Environments;Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security;2023-11-15
4. SpaceSaving
±;Proceedings of the VLDB Endowment;2022-02
5. Timely Reporting of Heavy Hitters Using External Memory;ACM Transactions on Database Systems;2021-12-31