MFFT: A GPU Accelerated Highly Efficient Mixed-Precision Large-Scale FFT Framework-Reference-Cited by-同舟云学术

MFFT: A GPU Accelerated Highly Efficient Mixed-Precision Large-Scale FFT Framework

Published:2023-07-22 Issue:3 Volume:20 Page:1-23
ISSN:1544-3566
Container-title:ACM Transactions on Architecture and Code Optimization
language:en
Short-container-title:ACM Trans. Archit. Code Optim.

Author:

Zhao Yuwen¹^ORCID,Liu Fangfang²^ORCID,Ma Wenjing²^ORCID,Li Huiyuan²^ORCID,Peng Yuanchi¹^ORCID,Wang Cui³^ORCID

Affiliation:

1. Institute of Software, Chinese Academy of Sciences, China and University of Chinese Academy of Sciences, China

2. Institute of Software, Chinese Academy of Sciences, China and State Key Laboratory of Computer Science, China

3. Institute of Software, Chinese Academy of Sciences, China

Abstract

Fast Fourier transform (FFT) is widely used in computing applications in large-scale parallel programs, and data communication is the main performance bottleneck of FFT and seriously affects its parallel efficiency. To tackle this problem, we propose a new large-scale FFT framework, MFFT, which optimizes parallel FFT with a new mixed-precision optimization technique, adopting the “high precision computation, low precision communication” strategy. To enable “low precision communication”, we propose a shared-exponent floating-point number compression technique, which reduces the volume of data communication, while maintaining higher accuracy. In addition, we apply a two-phase normalization technique to further reduce the round-off error. Based on the mixed-precision MFFT framework, we apply several optimization techniques to improve the performance, such as streaming of GPU kernels, MPI message combination, kernel optimization, and memory optimization. We evaluate MFFT on a system with 4,096 GPUs. The results show that shared-exponent MFFT is 1.23 × faster than that of double-precision MFFT on average, and double-precision MFFT achieves performance 3.53× and 9.48× on average higher than open source library 2Decomp&FFT (CPU-based version) and heFFTe (AMD GPU-based version), respectively. The parallel efficiency of double-precision MFFT increased from 53.2% to 78.1% compared with 2Decomp&FFT, and shared-exponent MFFT further increases the parallel efficiency to 83.8%.

Funder

National Key R&D Program of China

GHfund D

Publisher

Association for Computing Machinery (ACM)

Subject

Hardware and Architecture,Information Systems,Software

Link

https://dl.acm.org/doi/pdf/10.1145/3605148

Reference52 articles.

1. 2018. NVIDIA APEX.https://github.com/NVIDIA/apex.

2. 2019. CUFFT library. https://docs.nvidia.com/pdf/CUFFT_Library.pdf.

3. 2021. rocFFT Documentation. https://rocfft.readthedocs.io/en/rocm-4.2.0/.

4. 2022. heFFTe.https://bitbucket.org/icl/heffte.

5. 2022. Large-scale atomic/molecular massively parallel simulator. https://lammps.sandia.gov/.

Cited by 3 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. An Optimized GPU Implementation for GIST Descriptor;ACM Transactions on Architecture and Code Optimization;2024-08-23

2. EA4RCA: Efficient AIE accelerator design framework for regular Communication-Avoiding Algorithm;ACM Transactions on Architecture and Code Optimization;2024-07-15

3. Research on High-Performance Fourier Transform Algorithms Based on the NPU;Applied Sciences;2024-01-01