Timely Reporting of Heavy Hitters Using External Memory

Author:

Singh Shikha1,Pandey Prashant2,Bender Michael A.3,Berry Jonathan W.4,Farach-Colton Martín5,Johnson Rob6,Kroeger Thomas M.4,Phillips Cynthia A.4

Affiliation:

1. Williams College, Williamstown, MA

2. Lawrence Berkeley National Laboratories and University of California Berkeley

3. Stony Brook University, NY, USA

4. Sandia National Laboratories, Albuquerque, NM

5. Rutgers University, Piscataway, NJ

6. VMware Research, Palo Alto, CA

Abstract

Given an input stream S of size N , a ɸ-heavy hitter is an item that occurs at least ɸN times in S . The problem of finding heavy-hitters is extensively studied in the database literature. We study a real-time heavy-hitters variant in which an element must be reported shortly after we see its T = ɸ N-th occurrence (and hence it becomes a heavy hitter). We call this the Timely Event Detection ( TED ) Problem. The TED problem models the needs of many real-world monitoring systems, which demand accurate (i.e., no false negatives) and timely reporting of all events from large, high-speed streams with a low reporting threshold (high sensitivity). Like the classic heavy-hitters problem, solving the TED problem without false-positives requires large space (Ω (N) words). Thus in-RAM heavy-hitters algorithms typically sacrifice accuracy (i.e., allow false positives), sensitivity, or timeliness (i.e., use multiple passes). We show how to adapt heavy-hitters algorithms to external memory to solve the TED problem on large high-speed streams while guaranteeing accuracy, sensitivity, and timeliness. Our data structures are limited only by I/O-bandwidth (not latency) and support a tunable tradeoff between reporting delay and I/O overhead. With a small bounded reporting delay, our algorithms incur only a logarithmic I/O overhead. We implement and validate our data structures empirically using the Firehose streaming benchmark. Multi-threaded versions of our structures can scale to process 11M observations per second before becoming CPU bound. In comparison, a naive adaptation of the standard heavy-hitters algorithm to external memory would be limited by the storage device’s random I/O throughput, i.e., ≈100K observations per second.

Funder

NSF

Laboratory-Directed Research-and-Development program at Sandia National Laboratories

National Technology and Engineering Solutions of Sandia, LLC.

Honeywell International, Inc.

U.S. Department of Energy’s National Nuclear Security Administration

U.S. Department of Energy or the United States Government

Advanced Scientific Computing Research

Office of Science of the DOE

NERSC

Exascale Computing Project

U.S. Department of Energy Office of Science and the National Nuclear Security Administration

Publisher

Association for Computing Machinery (ACM)

Subject

Information Systems

Reference63 articles.

1. Zipf, Power Law, Pareto: A ranking tutorial. HP Research;Adamic L. A.;http://www.hpl.hp.com/research/idl/papers/ranking/ranking.html,2008

2. The input/output complexity of sorting and related problems;Aggarwal Alok;Commun. ACM,1988

3. FireHose Benchmarking Streaming Architectures;Anderson Karl;https://www.clsac.org/uploads/5/0/6/3/50633811/anderson-clsac-2016.pdf,2016

4. FireHose Streaming Benchmarks;Anderson Karl;https://github.com/stream-benchmarking/firehose,2013

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3