Affiliation:
1. Tsinghua University, Beijing, China
2. The Chinese University of Hong Kong, Hong Kong, China
Abstract
Structural graph clustering (SCAN) is a classic graph clustering algorithm. In SCAN, a key step is to compute the structural similarity between vertices according to the overlap ratio of one-hop neighborhoods. Given two vertices u and v, existing studies only consider the case when u and v are neighbors. However, the structural similarity between non-neighboring vertices in SCAN is always zero, and using only one-hop neighbors on weighted graphs discards the weights on each edge. Both may not reflect the true closeness of two vertices and may fail to return high-quality clustering results.
To tackle this issue, we define and study the distance-based structural graph clustering problem. Given a distance threshold d and two vertices u and v, the structural similarity between u and v is defined as the ratio of their respective neighbors within a distance of no more than d. We show that the newly defined distance-based SCAN achieves better clustering results compared to the vanilla version of SCAN. However, the new definition brings challenges in the computation of final clustering results. To tackle this efficiency issue, we propose DistanceSCAN, an efficient approximate algorithm for solving the distance-based SCAN problem. The main idea of DistanceSCAN is to use all-distances bottom-k sketches (ADS) to speed up the computation of similarities. Given the ADS, we can derive the similarity between two vertices with a bounded cost of O(k).
However, to ensure that the estimated similarity has an approximation guarantee, the value of k still needs to be set to as large as thousands. This brings high computational costs when computing the similarities between neighboring vertices. To tackle this issue, we further construct histograms to prune the structural similarity computations of vertices pairs. Extensive experiments on real datasets validate the effectiveness and efficiency of DistanceSCAN.
Funder
Hong Kong RGC GRF Grant
State Key Laboratory of Computer Architecture
Hong Kong RGC ECS Grant
Hong Kong RGC CRF Grant
National Key R&D Program of China
Hong Kong ITC ITF Grant
National Natural Science Foundation of China
Publisher
Association for Computing Machinery (ACM)
Reference31 articles.
1. Mihael Ankerst , Markus M. Breunig , Hans-Peter Kriegel , and Jörg Sander . 1999 . OPTICS: Ordering Points To Identify the Clustering Structure. In SIGMOD. 49--60. Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, and Jörg Sander. 1999. OPTICS: Ordering Points To Identify the Clustering Structure. In SIGMOD. 49--60.
2. Kevin Aydin , Mohammad Hossein Bateni, and Vahab S. Mirrokni . 2016 . Distributed Balanced Partitioning via Linear Embedding. In WSDM. 387--396. Kevin Aydin, Mohammad Hossein Bateni, and Vahab S. Mirrokni. 2016. Distributed Balanced Partitioning via Linear Embedding. In WSDM. 387--396.
3. Concentration inequalities for sampling without replacement
4. Paolo Boldi and Sebastiano Vigna. 2004. The webgraph framework I: compression techniques. In WWW. 595--602. Paolo Boldi and Sebastiano Vigna. 2004. The webgraph framework I: compression techniques. In WWW. 595--602.
5. Dustin Bortner and Jiawei Han. 2010. Progressive clustering of networks using Structure-Connected Order of Traversal. In ICDE. 653--656. Dustin Bortner and Jiawei Han. 2010. Progressive clustering of networks using Structure-Connected Order of Traversal. In ICDE. 653--656.
Cited by
4 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献