A fast vectorized sorting implementation based on the ARM scalable vector extension (SVE)-Reference-Cited by-同舟云学术

A fast vectorized sorting implementation based on the ARM scalable vector extension (SVE)

Published:2021-11-19 Issue: Volume:7 Page:e769
ISSN:2376-5992
Container-title:PeerJ Computer Science
language:en
Short-container-title:

Author:

Bramas Bérenger¹²

Affiliation:

1. CAMUS, Inria Nancy - Grand Est, Nancy, France

2. ICPS Team, ICube, Illkirch-Graffenstaden, France

Abstract

The way developers implement their algorithms and how these implementations behave on modern CPUs are governed by the design and organization of these. The vectorization units (SIMD) are among the few CPUs’ parts that can and must be explicitly controlled. In the HPC community, the x86 CPUs and their vectorization instruction sets were de-facto the standard for decades. Each new release of an instruction set was usually a doubling of the vector length coupled with new operations. Each generation was pushing for adapting and improving previous implementations. The release of the ARM scalable vector extension (SVE) changed things radically for several reasons. First, we expect ARM processors to equip many supercomputers in the next years. Second, SVE’s interface is different in several aspects from the x86 extensions as it provides different instructions, uses a predicate to control most operations, and has a vector size that is only known at execution time. Therefore, using SVE opens new challenges on how to adapt algorithms including the ones that are already well-optimized on x86. In this paper, we port a hybrid sort based on the well-known Quicksort and Bitonic-sort algorithms. We use a Bitonic sort to process small partitions/arrays and a vectorized partitioning implementation to divide the partitions. We explain how we use the predicates and how we manage the non-static vector size. We also explain how we efficiently implement the sorting kernels. Our approach only needs an array of O(log N) for the recursive calls in the partitioning phase, both in the sequential and in the parallel case. We test the performance of our approach on a modern ARMv8.2 (A64FX) CPU and assess the different layers of our implementation by sorting/partitioning integers, double floating-point numbers, and key/value pairs of integers. Our results show that our approach is faster than the GNU C++ sort algorithm by a speedup factor of 4 on average.

Publisher

PeerJ

Subject

General Computer Science

Link

https://peerj.com/articles/cs-769.pdf

Reference35 articles.

1. ECM modeling and performance tuning of SpMV and Lattice QCD on A64FX;Alappat,2021

2. Optimization of x265 encoder using ARM SVE;Aoki

3. ARM Architecture Reference Manual Supplement, The Scalable Vector Extension (SVE), for ARMv8-A (version Beta);ARM,2020

4. ARM C Language Extensions for SVE (version 00bet1);ARM,2020

5. Sorting networks and their applications;Batcher,1968

Cited by 7 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. SPC5: An efficient SpMV framework vectorized using ARM SVE and x86 AVX-512;Computer Science and Information Systems;2024

2. Efficient Large Integer Multiplication with Arm SVE Instructions;Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region;2023-02-27

3. Acceleration of Particle Swarm Optimization with AVX Instructions;Applied Sciences;2023-01-04

4. Performance Evaluation of Parallel Sortings on the Supercomputer Fugaku;Journal of Information Processing;2023

5. A one-for-all and o ( v log( v ))-cost solution for parallel merge style operations on sorted key-value arrays;Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems;2022-02-22