What's hot and what's not: tracking most frequent items dynamically

Author:

Cormode Graham1,Muthukrishnan S.2

Affiliation:

1. Rutgers University, Murray Hill, NJ

2. Rutgers University, Piscataway, NJ

Abstract

Most database management systems maintain statistics on the underlying relation. One of the important statistics is that of the “hot items” in the relation: those that appear many times (most frequently, or more than some threshold). For example, end-biased histograms keep the hot items as part of the histogram and are used in selectivity estimation. Hot items are used as simple outliers in data mining, and in anomaly detection in many applications.We present new methods for dynamically determining the hot items at any time in a relation which is undergoing deletion operations as well as inserts. Our methods maintain small space data structures that monitor the transactions on the relation, and, when required, quickly output all hot items without rescanning the relation in the database. With user-specified probability, all hot items are correctly reported. Our methods rely on ideas from “group testing.” They are simple to implement, and have provable quality, space, and time guarantees. Previously known algorithms for this problem that make similar quality and performance guarantees cannot handle deletions, and those that handle deletions cannot make similar guarantees without rescanning the database. Our experiments with real and synthetic data show that our algorithms are accurate in dynamically tracking the hot items independent of the rate of insertions and deletions.

Publisher

Association for Computing Machinery (ACM)

Subject

Information Systems

Reference33 articles.

1. Aho A. V. Hopcroft J. E. and Ullman J. D. 1987. Data structures and algorithms. Addison-Wesley Reading MA. Aho A. V. Hopcroft J. E. and Ullman J. D. 1987. Data structures and algorithms. Addison-Wesley Reading MA.

2. Tracking join and self-join sizes in limited storage

3. The Space Complexity of Approximating the Frequency Moments

4. Distributed top-k monitoring

Cited by 151 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Improved Lower Bound for Estimating the Number of Defective Items;Combinatorial Optimization and Applications;2023-12-09

2. Compact Frequency Estimators in Adversarial Environments;Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security;2023-11-15

3. Adaptive Group Testing on Networks With Community Structure: The Stochastic Block Model;IEEE Transactions on Information Theory;2023-07

4. Enhanced Machine Learning Sketches for Network Measurements;IEEE Transactions on Computers;2023-04-01

5. Adversarially Robust Streaming Algorithms via Differential Privacy;Journal of the ACM;2022-11-24

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3