1. Huang, H., Wang, Z., Zhang, J., He, Z., Wu, C., Xiao, J., Alonso, G.: Shuhai: a tool for benchmarking high bandwidth memory on fpgas. TC (2022)
2. Jia, Z., Maggioni, M., Staiger, B., Scarpazza, D.P.: Dissecting the nvidia volta gpu architecture via microbenchmarking (2018). arXiv:1804.06826
3. Jouppi, N.P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, A., : In-datacenter performance analysis of a tensor processing unit. In: Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 1–12 (2017)
4. Jouppi, N.P., Kurian, G., Li, S., Ma, P., Nagarajan, R., Nai, L., Patil, N., Subramanian, S., Swing, A., Towles, B., et al.: Tpu v4: an optically reconfigurable supercomputer for machine learning with hardware support for embeddings (2023). arXiv:2304.01433
5. Kumar, S., Bitorff, V., Chen, D., Chou, C., Hechtman, B., Lee, H., Kumar, N., Mattson, P., Wang, S., Wang, T., et al.: Scale mlperf-0.6 models on google tpu-v3 pods (2019). arXiv:1909.09756