Timely Reporting of Heavy Hitters Using External Memory
-
Published:2021-12-31
Issue:4
Volume:46
Page:1-35
-
ISSN:0362-5915
-
Container-title:ACM Transactions on Database Systems
-
language:en
-
Short-container-title:ACM Trans. Database Syst.
Author:
Singh Shikha1,
Pandey Prashant2,
Bender Michael A.3,
Berry Jonathan W.4,
Farach-Colton Martín5,
Johnson Rob6,
Kroeger Thomas M.4,
Phillips Cynthia A.4
Affiliation:
1. Williams College, Williamstown, MA
2. Lawrence Berkeley National Laboratories and University of California Berkeley
3. Stony Brook University, NY, USA
4. Sandia National Laboratories, Albuquerque, NM
5. Rutgers University, Piscataway, NJ
6. VMware Research, Palo Alto, CA
Abstract
Given an input stream
S
of size
N
, a
ɸ-heavy hitter
is an item that occurs at least
ɸN
times in
S
. The problem of finding heavy-hitters is extensively studied in the database literature.
We study a real-time heavy-hitters variant in which an element must be reported shortly after we see its T = ɸ N-th occurrence (and hence it becomes a heavy hitter). We call this the Timely Event Detection (
TED
) Problem. The
TED
problem models the needs of many real-world monitoring systems, which demand accurate (i.e., no false negatives) and timely reporting of all events from large, high-speed streams with a low reporting threshold (high sensitivity).
Like the classic heavy-hitters problem, solving the
TED
problem without false-positives requires large space (Ω (N) words). Thus in-RAM heavy-hitters algorithms typically sacrifice accuracy (i.e., allow false positives), sensitivity, or timeliness (i.e., use multiple passes).
We show how to adapt heavy-hitters algorithms to external memory to solve the
TED
problem on large high-speed streams while guaranteeing accuracy, sensitivity, and timeliness. Our data structures are limited only by I/O-bandwidth (not latency) and support a tunable tradeoff between reporting delay and I/O overhead. With a small bounded reporting delay, our algorithms incur only a logarithmic I/O overhead.
We implement and validate our data structures empirically using the Firehose streaming benchmark. Multi-threaded versions of our structures can scale to process 11M observations per second before becoming CPU bound. In comparison, a naive adaptation of the standard heavy-hitters algorithm to external memory would be limited by the storage device’s random I/O throughput, i.e., ≈100K observations per second.
Funder
NSF
Laboratory-Directed Research-and-Development program at Sandia National Laboratories
National Technology and Engineering Solutions of Sandia, LLC.
Honeywell International, Inc.
U.S. Department of Energy’s National Nuclear Security Administration
U.S. Department of Energy or the United States Government
Advanced Scientific Computing Research
Office of Science of the DOE
NERSC
Exascale Computing Project
U.S. Department of Energy Office of Science and the National Nuclear Security Administration
Publisher
Association for Computing Machinery (ACM)
Subject
Information Systems
Reference63 articles.
1. Zipf, Power Law, Pareto: A ranking tutorial. HP Research;Adamic L. A.;http://www.hpl.hp.com/research/idl/papers/ranking/ranking.html,2008
2. The input/output complexity of sorting and related problems;Aggarwal Alok;Commun. ACM,1988
3. FireHose Benchmarking Streaming Architectures;Anderson Karl;https://www.clsac.org/uploads/5/0/6/3/50633811/anderson-clsac-2016.pdf,2016
4. FireHose Streaming Benchmarks;Anderson Karl;https://github.com/stream-benchmarking/firehose,2013