1. LINDHOLM E, NICKOLLS J, OBERMAN S, et al. NVIDIA tesla: A unified graphics and computing architecture [J]. IEEE Micro, 2008, 28(2): 39–55.
2. DAI H, LIN Z, LI C, et al. Accelerate GPU concurrent kernel execution by mitigating memory pipeline stalls [C]//Proceedings of the 24th International Symposium on High Performance Computer Architecture (HPCA). Piscataway, NJ, USA: IEEE, 2018, 208–220.
3. KIM K, RO W W. WIR: Warp instruction reuse to minimize repeated computations in GPUs [C]//IEEE International Symposium on High Performance Computer Architecture (HPCA). Piscataway, NJ, USA: IEEE, 2018, 389–402.
4. ABBASITABAR H, SAMAVATIAN M H, SARBAZIAZAD H. ASHA: An adaptive shared-memory sharing architecture for multi-programmed GPUs [J]. Microprocessors and Microsystems, 2016, 46: 264–273.
5. OH B, KIM N S, AHN J, et al. A load balancing technique for memory channels [C]//International Symposium on Memory Systems. New York, USA: ACM, 2018, 55–66.