Scaling High-Quality Pairwise Link-Based Similarity Retrieval on Billion-Edge Graphs-Reference-Cited by-同舟云学术

Scaling High-Quality Pairwise Link-Based Similarity Retrieval on Billion-Edge Graphs

Published:2022-01-11 Issue:4 Volume:40 Page:1-45
ISSN:1046-8188
Container-title:ACM Transactions on Information Systems
language:en
Short-container-title:ACM Trans. Inf. Syst.

Author:

Yu Weiren¹^ORCID,McCann Julie²^ORCID,Zhang Chengyuan³^ORCID,Ferhatosmanoglu Hakan¹^ORCID

Affiliation:

1. University of Warwick, Coventry, UK

2. Imperial College, London, UK

3. Hunan University, Changsha, China

Abstract

SimRank is an attractive link-based similarity measure used in fertile fields of Web search and sociometry. However, the existing deterministic method by Kusumoto et al. [ 24 ] for retrieving SimRank does not always produce high-quality similarity results, as it fails to accurately obtain diagonal correction matrix D . Moreover, SimRank has a “connectivity trait” problem: increasing the number of paths between a pair of nodes would decrease its similarity score. The best-known remedy, SimRank++ [ 1 ], cannot completely fix this problem, since its score would still be zero if there are no common in-neighbors between two nodes. In this article, we study fast high-quality link-based similarity search on billion-scale graphs. (1) We first devise a “varied- D ” method to accurately compute SimRank in linear memory. We also aggregate duplicate computations, which reduces the time of [ 24 ] from quadratic to linear in the number of iterations. (2) We propose a novel “cosine-based” SimRank model to circumvent the “connectivity trait” problem. (3) To substantially speed up the partial-pairs “cosine-based” SimRank search on large graphs, we devise an efficient dimensionality reduction algorithm, PSR # , with guaranteed accuracy. (4) We give mathematical insights to the semantic difference between SimRank and its variant, and correct an argument in [ 24 ] that “if D is replaced by a scaled identity matrix (1-Ɣ)I, their top-K rankings will not be affected much”. (5) We propose a novel method that can accurately convert from Li et al. SimRank ~{S} to Jeh and Widom’s SimRank S . (6) We propose GSR # , a generalisation of our “cosine-based” SimRank model, to quantify pairwise similarities across two distinct graphs, unlike SimRank that would assess nodes across two graphs as completely dissimilar. Extensive experiments on various datasets demonstrate the superiority of our proposed approaches in terms of high search quality, computational efficiency, accuracy, and scalability on billion-edge graphs.

Funder

National Natural Science Foundation of China

Natural Science Foundation of Jiangsu Province

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Science Applications,General Business, Management and Accounting,Information Systems

Link

https://dl.acm.org/doi/pdf/10.1145/3495209

Reference59 articles.

1. SimRank++: Query rewriting through link analysis of the click graph;Antonellis Ioannis;PVLDB,2008

2. Peter Christen. 2012. Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer. DOI:https://doi.org/10.1007/978-3-642-31164-2

3. Data integration using similarity joins and a word-based information representation language

4. Random walks on the click graph

5. P-Simrank: Extending Simrank to Scale-Free Bipartite Networks

Cited by 3 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A Multi-Type Transferable Method for Missing Link Prediction in Heterogeneous Social Networks;IEEE Transactions on Knowledge and Data Engineering;2023-11-01

2. SimSky: An Accuracy-Aware Algorithm for Single-Source SimRank Search;Machine Learning and Knowledge Discovery in Databases: Research Track;2023

3. CoSimHeat: An Effective Heat Kernel Similarity Measure Based on Billion-Scale Network Topology✱;Proceedings of the ACM Web Conference 2022;2022-04-25