Sampling a Near Neighbor in High Dimensions — Who is the Fairest of Them All?-Reference-Cited by-同舟云学术

Sampling a Near Neighbor in High Dimensions — Who is the Fairest of Them All?

Published:2022-03-31 Issue:1 Volume:47 Page:1-40
ISSN:0362-5915
Container-title:ACM Transactions on Database Systems
language:en
Short-container-title:ACM Trans. Database Syst.

Author:

Aumüller Martin¹^ORCID,Har-Peled Sariel²^ORCID,Mahabadi Sepideh³^ORCID,Pagh Rasmus⁴^ORCID,Silvestri Francesco⁵^ORCID

Affiliation:

1. IT University of Copenhagen, København S, Denmark

2. University of Illinois at Urbana-Champaign, Urbana, IL, USA

3. Toyota Technological Institute at Chicago, Chicago, IL, USA

4. BARC and University of Copenhagen, København Ø, Denmark

5. University of Padova, Padova, Italy

Abstract

Similarity search is a fundamental algorithmic primitive, widely used in many computer science disciplines. Given a set of points S and a radius parameter r > 0, the r-near neighbor ( r -NN) problem asks for a data structure that, given any query point q , returns a point p within distance at most r from q . In this paper, we study the r -NN problem in the light of individual fairness and providing equal opportunities: all points that are within distance r from the query should have the same probability to be returned. In the low-dimensional case, this problem was first studied by Hu, Qiao, and Tao (PODS 2014). Locality sensitive hashing (LSH) , the theoretically strongest approach to similarity search in high dimensions, does not provide such a fairness guarantee. In this work, we show that LSH based algorithms can be made fair, without a significant loss in efficiency. We propose several efficient data structures for the exact and approximate variants of the fair NN problem. Our approach works more generally for sampling uniformly from a sub-collection of sets of a given collection and can be used in a few other applications. We also develop a data structure for fair similarity search under inner product that requires nearly-linear space and exploits locality sensitive filters. The paper concludes with an experimental evaluation that highlights the unfairness of state-of-the-art NN data structures and shows the performance of our algorithms on real-world datasets.

Funder

NSF AF award

UniPD

PRIN

Publisher

Association for Computing Machinery (ACM)

Subject

Information Systems

Link

https://dl.acm.org/doi/pdf/10.1145/3502867

Reference70 articles.

1. Research Directions for Principles of Data Management (Abridged)

2. Eytan Adar. 2007. User 4xxxxx9: Anonymizing query logs. (01 2007). http://www2007.org/workshops/paper_52.pdf. Appeared in the workshop Query Log Analysis: Social and Technological Challenges in association with WWW 2007.

3. Optimization-Based Approaches for Maximizing Aggregate Recommendation Diversity

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Efficient Dynamic Weighted Set Sampling and Its Extension;Proceedings of the VLDB Endowment;2023-09

2. Simpler is Much Faster: Fair and Independent Inner Product Search;Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval;2023-07-18