1. Bakhoda A, Yuan G, Fung W, et al., 2009. Analyzing CUDA workloads using a detailed GPU simulator. ISPASS IEEE Int Symp on Performance Analysis of Systems and Software, p.163–174. https://doi.org/10.1109/ISPASS.2009.4919648
2. Che S, Boyer M, Meng J, et al., 2009. Rodinia: a benchmark suite for heterogeneous computing. IISWC IEEE Int Symp on Workload Characterization, p.44–54. https://doi.org/10.1109/IISWC.2009.5306797
3. Chen J, Tao X, Yang Z, et al., 2013. Guided region-based GPU scheduling: utilizing multi-thread parallelism to hide memory latency. IEEE 27th Int Symp on Parallel & Distributed Processing, p.441–451. https://doi.org/10.1109/IPDPS.2013.95
4. Chen X, Chang L, Rodrigues C, et al., 2014. Adaptive cache management for energy-efficient GPU computing. Proc 47th Annual IEEE/ACM Int Symp on Microarchitecture, p.343–355. https://doi.org/10.1109/MICRO.2014.11
5. Dally W, Labonte F, Das A, et al., 2003. Merrimac: supercomputing with streams. Proc ACM/IEEE Conf on Supercomputing, Article 35. https://doi.org/10.1145/1048935.1050187