Affiliation:
1. Intel Labs
2. Texas A&M University
3. Palo Alto Research Center
Abstract
We propose Graph Priority Sampling (
gps
), a new paradigm for order-based reservoir sampling from massive graph streams.
gps
provides a general way to weight edge sampling according to auxiliary and/or size variables so as to accomplish various estimation goals of graph properties. In the context of subgraph counting, we show how edge sampling weights can be chosen so as to minimize the estimation variance of counts of specified sets of subgraphs. In distinction with many prior graph sampling schemes,
gps
separates the functions of edge sampling and subgraph estimation. We propose two estimation frameworks: (1) Post-Stream estimation, to allow
gps
to construct a reference sample of edges to support retrospective graph queries, and (2) In-Stream estimation, to allow
gps
to obtain lower variance estimates by incrementally updating the subgraph count estimates during stream processing. Unbiasedness of subgraph estimators is established through a new Martingale formulation of graph stream order sampling, in which subgraph estimators, written as a product of constituent edge estimators, are unbiased, even when computed at different points in the stream. The separation of estimation and sampling enables significant resource savings relative to previous work. We illustrate our framework with applications to triangle and wedge counting. We perform a large-scale experimental study on real-world graphs from various domains and types.
gps
achieves high accuracy with < 1% error for triangle and wedge counting, while storing a small fraction of the graph with average update times of a few microseconds per edge. Notably, for billion-scale graphs,
gps
accurately estimates triangle and wedge counts with < 1% error, while storing a small fraction of < 0.01% of the total edges in the graph.
Subject
General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development
Cited by
52 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献