Affiliation:
1. University of Illinois at Urbana-Champaign, Urbana, IL
2. Microsoft Research Asia, Beijing, China
Abstract
Link-based ranking algorithms are among the most important techniques to improve web search. In particular, the PageRank algorithm has been successfully used in the Google search engine and has been attracting much attention recently. However, we find that PageRank has a “zero-one gap” problem which, to the best of our knowledge, has not been addressed in any previous work. This problem can be potentially exploited to spam PageRank results and make the state-of-the-art link-based antispamming techniques ineffective. The zero-one gap problem arises as a result of the current ad hoc way of computing transition probabilities in the random surfing model. We therefore propose a novel
DirichletRank
algorithm which calculates these probabilities using Bayesian estimation with a Dirichlet prior. DirichletRank is a variant of PageRank, but does not have the problem of zero-one gap and can be analytically shown substantially more resistant to some link spams than PageRank. Experiment results on TREC data show that DirichletRank can achieve better retrieval accuracy than PageRank due to its more reasonable allocation of transition probabilities. More importantly, experiments on the TREC dataset and another real web dataset from the Webgraph project show that, compared with the original PageRank, DirichletRank is more stable under link perturbation and is significantly more robust against both manually identified web spams and several simulated link spams. DirichletRank can be computed as efficiently as PageRank, and thus is scalable to large-scale web applications.
Funder
Division of Information and Intelligent Systems
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Science Applications,General Business, Management and Accounting,Information Systems
Cited by
36 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献