Affiliation:
1. University of Copenhagen, Denmark
Abstract
Several properties of information retrieval (IR) data, such as query frequency or document length, are widely considered to be approximately distributed as a power law. This common assumption aims to focus on specific characteristics of the empirical probability distribution of such data (e.g., its scale-free nature or its long/fat tail). This assumption, however, may not be always true. Motivated by recent work in the statistical treatment of power law claims, we investigate two research questions: (i) To what extent do power law approximations hold for term frequency, document length, query frequency, query length, citation frequency, and syntactic unigram frequency? And (ii) what is the computational cost of replacing ad hoc power law approximations with more accurate distribution fitting? We study 23 TREC and 5 non-TREC datasets and compare the fit of power laws to 15 other standard probability distributions. We find that query frequency and 5 out of 24 term frequency distributions are best approximated by a power law. All remaining properties are better approximated by the Inverse Gaussian, Generalized Extreme Value, Negative Binomial, or Yule distribution. We also find the overhead of replacing power law approximations by more informed distribution fitting to be negligible, with potential gains to IR tasks like index compression or test collection generation for IR evaluation.
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Science Applications,General Business, Management and Accounting,Information Systems
Cited by
36 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. HyGate-GCN: Hybrid-Gate-Based Graph Convolutional Networks with dynamical ratings estimation for personalized POI recommendation;Expert Systems with Applications;2024-12
2. What Matters in a Measure? A Perspective from Large-Scale Search Evaluation;Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval;2024-07-10
3. Oblique Logistic Function for the Rank-Frequency Distribution of Letters;2023 4th International Informatics and Software Engineering Conference (IISEC);2023-12-21
4. Measuring Service-Level Learning Effects in Search Via Query-Randomized Experiments;Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval;2023-07-18
5. A Two-Level Signature Scheme for Stable Set Similarity Joins;Proceedings of the VLDB Endowment;2023-07