Affiliation:
1. UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN URBANA, ILLINOIS
61801-2932, USA
Abstract
Linear algebra algorithms based on the BLAS or ex tended BLAS do not achieve high performance on mul tivector processors with a hierarchical memory system because of a lack of data locality. For such machines, block linear algebra algorithms must be implemented in terms of matrix-matrix primitives (BLAS3). Designing ef ficient linear algebra algorithms for these architectures requires analysis of the behavior of the matrix-matrix primitives and the resulting block algorithms as a func tion of certain system parameters. The analysis must identify the limits of performance improvement possible via blocking and any contradictory trends that require trade-off consideration. We propose a methodology that facilitates such an analysis and use it to analyze the per formance of the BLAS3 primitives used in block methods. A similar analysis of the block size-perfor mance relationship is also performed at the algorithm level for block versions of the LU decomposition and the Gram-Schmidt orthogonalization procedures.
Cited by
55 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Block algorithms of a simultaneous difference solution of d’Alembert's and Maxwell's equations;Computer Optics;2018-07-24
2. General Linear Systems;Parallelism in Matrix Computations;2015-07-26
3. Fundamental Kernels;Parallelism in Matrix Computations;2015-07-26
4. Parallel and Cache-Efficient In-Place Matrix Storage Format Conversion;ACM Transactions on Mathematical Software;2012-04
5. Cache Blocking;Applied Parallel and Scientific Computing;2012