High Performance Parallel LOBPCG Method for Large Hamiltonian Derived from Hubbard Model on Multi-GPU Systems-Reference-Cited by-同舟云学术

High Performance Parallel LOBPCG Method for Large Hamiltonian Derived from Hubbard Model on Multi-GPU Systems

Published:2022 Issue: Volume: Page:1-19
ISSN:0302-9743
Container-title:Supercomputing Frontiers
language:
Short-container-title:

Author:

Yamada Susumu,Imamura Toshiyuki,Machida Masahiko

Abstract

AbstractThe physical property of the Hubbard model can be understood by solving the eigenvalue problem for the Hamiltonian derived from the model. Since the Hamiltonian is a large sparse matrix, an iteration method is usually utilized for solving the problems. One of effectual solvers for this problem is the LOBPCG (Locally Optimal Block Preconditioned Conjugate Gradient) method. The tuning strategies of the method on GPU systems when all iteration vectors are stored in device memory have been proposed. In this research, we propose tuning strategies for parallel LOBPCG method on multi-GPU system when the Hamiltonian is large and some iteration vectors are stored in host memory. When the LOBPCG method is used for solving multi eigenpairs (eigenvalues and the corresponding eigenvectors), the number of iteration vectors, whose size is the same as the dimension of the Hamiltonian, is proportional to the number of the eigenpairs. On the other hand, the memory consumption for the non-zero elements of the Hamiltonian can be significantly reduced by considering the regular arrangement of the elements. Therefore, when we execute the LOBPCG method for a large Hamiltonian on GPUs, some of the vectors have to be stored on host memory and have to be transferred between host and device memory as needed. Since the cost of the data transfer is very large, we also propose the optimization for it. The simulation result on a multi-GPU system shows that the optimization of the data transfer is very effective for high performance computing.

Publisher

Springer International Publishing

Link

https://link.springer.com/content/pdf/10.1007/978-3-031-10419-0_1

Reference17 articles.

1. Anzt, H., Tomov, S., Dongarra, J.: Accelerating the LOBPCG method on GPUs using a blocked sparse matrix vector product. In: Proceedings of the Symposium on High Performance Computing, pp. 75–82 (2015)

2. Demmel, J., Grigori, L., Hoemmen, M., Langou, J.: Communication-optimal parallel and sequential QR and LU factorizations. SIAM J. Sci. Comput. 34, A206–A239 (2012). https://doi.org/10.1137/080731992

3. Duersch, J.A., Gu, M., Shao, M., Yang, C.: A robust and efficient implementation of LOBPCG. SIAM J. Sci. Comput. 40, C655–C676 (2018). https://doi.org/10.1137/17M1129830

4. Furuya, T., Nakatsukasa, Y., Yanagisawa, Y., Yamamoto, Y.: CholeskyQR2: a simple and communication-avoiding algorithm for computing a Tall-Skinny QR factorization on a large-scale parallel system. In: ScalA 2014 (2014)

5. Hetmaniuk, U., Lehoucq, R.: Basis selection in LOBPCG. J. Comput. Phys. 228, 324–332 (2006)