Scaling High-Quality Pairwise Link-Based Similarity Retrieval on Billion-Edge Graphs

Author:

Yu Weiren1ORCID,McCann Julie2ORCID,Zhang Chengyuan3ORCID,Ferhatosmanoglu Hakan1ORCID

Affiliation:

1. University of Warwick, Coventry, UK

2. Imperial College, London, UK

3. Hunan University, Changsha, China

Abstract

SimRank is an attractive link-based similarity measure used in fertile fields of Web search and sociometry. However, the existing deterministic method by Kusumoto et al. [ 24 ] for retrieving SimRank does not always produce high-quality similarity results, as it fails to accurately obtain diagonal correction matrix  D . Moreover, SimRank has a “connectivity trait” problem: increasing the number of paths between a pair of nodes would decrease its similarity score. The best-known remedy, SimRank++ [ 1 ], cannot completely fix this problem, since its score would still be zero if there are no common in-neighbors between two nodes. In this article, we study fast high-quality link-based similarity search on billion-scale graphs. (1) We first devise a “varied- D ” method to accurately compute SimRank in linear memory. We also aggregate duplicate computations, which reduces the time of [ 24 ] from quadratic to linear in the number of iterations. (2) We propose a novel “cosine-based” SimRank model to circumvent the “connectivity trait” problem. (3) To substantially speed up the partial-pairs “cosine-based” SimRank search on large graphs, we devise an efficient dimensionality reduction algorithm, PSR # , with guaranteed accuracy. (4) We give mathematical insights to the semantic difference between SimRank and its variant, and correct an argument in [ 24 ] that “if D is replaced by a scaled identity matrix (1-Ɣ)I, their top-K rankings will not be affected much”. (5) We propose a novel method that can accurately convert from Li et al.  SimRank ~{S} to Jeh and Widom’s SimRank S . (6) We propose GSR # , a generalisation of our “cosine-based” SimRank model, to quantify pairwise similarities across two distinct graphs, unlike SimRank that would assess nodes across two graphs as completely dissimilar. Extensive experiments on various datasets demonstrate the superiority of our proposed approaches in terms of high search quality, computational efficiency, accuracy, and scalability on billion-edge graphs.

Funder

National Natural Science Foundation of China

Natural Science Foundation of Jiangsu Province

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Science Applications,General Business, Management and Accounting,Information Systems

Reference59 articles.

1. SimRank++: Query rewriting through link analysis of the click graph;Antonellis Ioannis;PVLDB,2008

2. Peter Christen. 2012. Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer. DOI:https://doi.org/10.1007/978-3-642-31164-2

3. Data integration using similarity joins and a word-based information representation language

4. Random walks on the click graph

5. P-Simrank: Extending Simrank to Scale-Free Bipartite Networks

Cited by 3 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. A Multi-Type Transferable Method for Missing Link Prediction in Heterogeneous Social Networks;IEEE Transactions on Knowledge and Data Engineering;2023-11-01

2. SimSky: An Accuracy-Aware Algorithm for Single-Source SimRank Search;Machine Learning and Knowledge Discovery in Databases: Research Track;2023

3. CoSimHeat: An Effective Heat Kernel Similarity Measure Based on Billion-Scale Network Topology✱;Proceedings of the ACM Web Conference 2022;2022-04-25

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3