Optimizing the Linear Fascicle Evaluation Algorithm for Multi-core and Many-core Systems-Reference-Cited by-同舟云学术

Optimizing the Linear Fascicle Evaluation Algorithm for Multi-core and Many-core Systems

Published:2020-12 Issue:4 Volume:7 Page:1-45
ISSN:2329-4949
Container-title:ACM Transactions on Parallel Computing
language:en
Short-container-title:ACM Trans. Parallel Comput.

Author:

Aggarwal Karan¹,Bondhugula Uday¹

Affiliation:

1. Indian Institute of Science, India

Abstract

Sparse matrix-vector multiplication ( SpMV ) operations are commonly used in various scientific and engineering applications. The performance of the SpMV operation often depends on exploiting regularity patterns in the matrix. Various representations and optimization techniques have been proposed to minimize the memory bandwidth bottleneck arising from the irregular memory access pattern involved. Among recent representation techniques, tensor decomposition is a popular one used for very large but sparse matrices. Post sparse-tensor decomposition, the new representation involves indirect accesses, making it challenging to optimize for multi-cores and even more demanding for the massively parallel architectures, such as on GPUs. Computational neuroscience algorithms often involve sparse datasets while still performing long-running computations on them. The Linear Fascicle Evaluation (LiFE) application is a popular neuroscience algorithm used for pruning brain connectivity graphs. The datasets employed herein involve the Sparse Tucker Decomposition (STD)—a widely used tensor decomposition method. Using this decomposition leads to multiple indirect array references, making it very difficult to optimize on both multi-core and many-core systems. Recent implementations of the LiFE algorithm show that its SpMV operations are the key bottleneck for performance and scaling. In this work, we first propose target-independent optimizations to optimize the SpMV operations of LiFE decomposed using the STD technique, followed by target-dependent optimizations for CPU and GPU systems. The target-independent techniques include: (1) standard compiler optimizations to prevent unnecessary and redundant computations, (2) data restructuring techniques to minimize the effects of indirect array accesses, and (3) methods to partition computations among threads to obtain coarse-grained parallelism with low synchronization overhead. Then, we present the target-dependent optimizations for CPUs such as: (1) efficient synchronization-free thread mapping and (2) utilizing BLAS calls to exploit hardware-specific speed. Following that, we present various GPU-specific optimizations to optimally map threads at the granularity of warps, thread blocks, and grid. Furthermore, to automate the CPU-based optimizations developed for this algorithm, we also extend the PolyMage domain-specific language, embedded in Python. Our highly optimized and parallelized CPU implementation obtains a speedup of 6.3× over the naive parallel CPU implementation running on 16-core Intel Xeon Silver (Skylake-based) system. In addition to that, our optimized GPU implementation achieves a speedup of 5.2× over a reference-optimized GPU code version on NVIDIA’s GeForce RTX 2080 Ti GPU and a speedup of 9.7× over our highly optimized and parallelized CPU implementation.

Funder

Science and Engineering Research Board

Publisher

Association for Computing Machinery (ACM)

Subject

Computational Theory and Mathematics,Computer Science Applications,Hardware and Architecture,Modelling and Simulation,Software

Link

https://dl.acm.org/doi/pdf/10.1145/3418075

Reference118 articles.

1. Multiway analysis of epilepsy tensors

2. Seizure Recognition on Epilepsy Feature Tensor

3. Modeling and Multiway Analysis of Chatroom Tensors

4. Collective Sampling and Analysis of High Order Tensors for Chatroom Communications

5. Optimizing the linear fascicle evaluation algorithm for many-core systems

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Optimization Techniques for GPU Programming;ACM Computing Surveys;2023-03-16