Abstract
Mergesort is a popular algorithm for sorting real-world workloads as it is immune to data skewness, suitable for parallelization using vectorized intrinsics, and relatively simple to multi-thread. In this paper, we introduce
Origami
, an in-memory merge-sort framework that is optimized for scalar, as well as all current SIMD (single-instruction multiple-data) CPU architectures. For each vector-extension set (e.g., SSE, AVX2, AVX-512), we present an in-register sorter for small sequences that is up to 8× faster than prior methods and a branchless streaming merger that achieves up to a 1.5× speed-up over the naive merge. In addition, we introduce a cache-residing quad-merge tree to avoid bottlenecking on memory bandwidth and a parallel partitioning scheme to maximize thread-level concurrency. We develop an end-to-end sort with these components and produce a highly utilized mergesort pipeline by reducing the synchronization overhead between threads. Single-threaded Origami performs up to 2× faster than the closest competitor and achieves a nearly perfect speed-up in multi-core environments.
Publisher
Association for Computing Machinery (ACM)
Subject
General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development
Reference33 articles.
1. Arif Arman and Dmitri Loguinov . 2021 . Origami Souce Code . Retrieved December 13, 2021 from https://github.com/arif-arman/origami-sort Arif Arman and Dmitri Loguinov. 2021. Origami Souce Code. Retrieved December 13, 2021 from https://github.com/arif-arman/origami-sort
2. Sorting networks and their applications
3. A comparison of sorting algorithms for the connection machine CM-2
4. Analyzing variants of Shellsort
5. Efficient sample sort and the average case analysis of PEsort