Affiliation:
1. The University of Sydney, Sydney, Australia
2. The Chinese University of Hong Kong, Hong Kong, China
Abstract
Counting and enumerating all occurrences of
k
-cliques, i.e., complete subgraphs with
k
vertices, in a large graph
G
is a fundamental problem with many applications. However, exact solutions are often infeasible due to the exponential growth in the number of
k
-cliques when
k
increases. Thus, a more practical approach is approximately counting and uniformly sampling
k
-cliques. Turán-Shadow and DPColorPath are two state-of-the-art algorithms for approximately counting
k
-cliques. The general idea is first constructing a sample space that is a superset of all
k
-cliques in
G
, and then sampling
t elements
uniformly-at-random (u.a.r.) from the sample space for a pre-determined
t
; the
k
-clique count is estimated as the sample space size multiplied by the ratio of
k
-cliques among the
t
samples. Although techniques have been proposed in Turán-Shadow for setting
t
to ensure the estimation accuracy, the theoretically chosen
t
is often too large to be practical. As a result, both of the existing algorithms used a fixed
t
in their implementations and thus do not offer accuracy guarantee. In this paper, we propose the first randomized algorithm that achieves the theoretical estimation accuracy and the practical efficiency at the same time. Different from the existing algorithms, we pre-determine the number
s
of
k-clique samples
that are required to achieve the estimation accuracy. Consequently, we can estimate the running time of the sampling stage (i.e., time taken to sample
sk
-cliques), for a given sample space. Then, we propose to balance the time of constructing/refining the sample space and the time of the sampling stage, by stopping the refinement of the sample space once the elapsed time is comparable to the estimated time of the sampling stage. Extensive empirical studies on large real graphs show that our algorithm SR-kCCE provides an accurate
k
-clique count estimation and also runs efficiently. As a by-product, our algorithm can also be used for
efficiently sampling
a certain number of
k
-cliques u.a.r. from
G.
Publisher
Association for Computing Machinery (ACM)
Reference29 articles.
1. Parallel K-clique counting on GPUs. In Proc. of ICS'22;Almasri Mohammad;ACM,2022
2. Motif Counting Beyond Five Nodes;Bressan Marco;ACM Trans. Knowl. Discovery Data,2018
3. Algorithm 457: finding all cliques of an undirected graph
4. Springer Series in the Data Sciences;Chang Lijun
5. Arboricity and Subgraph Listing Algorithms