A seven-dimensional analysis of hashing methods and its implications on query processing-Reference-Cited by-同舟云学术

A seven-dimensional analysis of hashing methods and its implications on query processing

Published:2015-11 Issue:3 Volume:9 Page:96-107
ISSN:2150-8097
Container-title:Proceedings of the VLDB Endowment
language:en
Short-container-title:Proc. VLDB Endow.

Author:

Richter Stefan¹,Alvarez Victor²,Dittrich Jens¹

Affiliation:

1. Saarland University

2. TU Braunschweig

Abstract

Hashing is a solved problem. It allows us to get constant time access for lookups. Hashing is also simple. It is safe to use an arbitrary method as a black box and expect good performance, and optimizations to hashing can only improve it by a negligible delta. Why are all of the previous statements plain wrong? That is what this paper is about. In this paper we thoroughly study hashing for integer keys and carefully analyze the most common hashing methods in a five-dimensional requirements space: (1) data-distribution, (2) load factor, (3) dataset size, (4) read/write-ratio, and (5) un/successful-ratio. Each point in that design space may potentially suggest a different hashing scheme, and additionally also a different hash function. We show that a right or wrong decision in picking the right hashing scheme and hash function combination may lead to significant difference in performance. To substantiate this claim, we carefully analyze two additional dimensions: (6) five representative hashing schemes (which includes an improved variant of Robin Hood hashing), (7) four important classes of hash functions widely used today. That is, we consider 20 different combinations in total. Finally, we also provide a glimpse about the effect of table memory layout and the use of SIMD instructions. Our study clearly indicates that picking the right combination may have considerable impact on insert and lookup performance, as well as memory footprint. A major conclusion of our work is that hashing should be considered a white box before blindly using it in applications, such as query processing. Finally, we also provide a strong guideline about when to use which hashing method.

Publisher

VLDB Endowment

Subject

General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development

Link

https://dl.acm.org/doi/pdf/10.14778/2850583.2850585

Cited by 35 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. High-Performance Sorting-Based K-mer Counting in Distributed Memory with Flexible Hybrid Parallelism;Proceedings of the 53rd International Conference on Parallel Processing;2024-08-12

2. Simple, Efficient, and Robust Hash Tables for Join Processing;Proceedings of the 20th International Workshop on Data Management on New Hardware;2024-06-09

3. Differentiating Set Intersections in Maximal Clique Enumeration by Function and Subproblem Size;Proceedings of the 38th ACM International Conference on Supercomputing;2024-05-30

4. Two-Way Linear Probing Revisited;Algorithms;2023-10-28

5. Analyzing Vectorized Hash Tables across CPU Architectures;Proceedings of the VLDB Endowment;2023-07