Affiliation:
1. University of Texas at San Antonio, TX
Abstract
In LAPACK many matrix operations are cast as block algorithms which iteratively process a panel using an unblocked algorithm and then update a remainder matrix using the high performance Level 3 BLAS. The Level 3 BLAS have excellent scaling, but panel processing tends to be bus bound, and thus scales with bus speed rather than the number of processors (
p
). Amdahl's law therefore ensures that as
p
grows, the panel computation will become the dominant cost of these LAPACK routines. Our contribution is a novel parallel cache assignment approach to the panel factorization which we show scales well with
p
. We apply this general approach to the QR, QL, RQ, LQ and LU panel factorizations. We show results for two commodity platforms: an 8-core Intel platform and a 32-core AMD platform. For both platforms and all twenty implementations (five factorizations each of which is available in 4 types), we present results that demonstrate that our approach yields significant speedup over the existing state of the art.
Funder
Office of Cyberinfrastructure
National Science Foundation
Division of Computing and Communication Foundations
Publisher
Association for Computing Machinery (ACM)
Subject
Applied Mathematics,Software
Cited by
7 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献