In-Datacenter Performance Analysis of a Tensor Processing Unit-Reference-Cited by-同舟云学术

In-Datacenter Performance Analysis of a Tensor Processing Unit

Published:2017-09-14 Issue:2 Volume:45 Page:1-12
ISSN:0163-5964
Container-title:ACM SIGARCH Computer Architecture News
language:en
Short-container-title:SIGARCH Comput. Archit. News

Author:

Jouppi Norman P.¹,Young Cliff¹,Patil Nishant¹,Patterson David¹,Agrawal Gaurav¹,Bajwa Raminder¹,Bates Sarah¹,Bhatia Suresh¹,Boden Nan¹,Borchers Al¹,Boyle Rick¹,Cantin Pierre-luc¹,Chao Clifford¹,Clark Chris¹,Coriell Jeremy¹,Daley Mike¹,Dau Matt¹,Dean Jeffrey¹,Gelb Ben¹,Ghaemmaghami Tara Vazir¹,Gottipati Rajendra¹,Gulland William¹,Hagmann Robert¹,Ho C. Richard¹,Hogberg Doug¹,Hu John¹,Hundt Robert¹,Hurt Dan¹,Ibarz Julian¹,Jaffey Aaron¹,Jaworski Alek¹,Kaplan Alexander¹,Khaitan Harshit¹,Killebrew Daniel¹,Koch Andy¹,Kumar Naveen¹,Lacy Steve¹,Laudon James¹,Law James¹,Le Diemthu¹,Leary Chris¹,Liu Zhuyuan¹,Lucke Kyle¹,Lundin Alan¹,MacKean Gordon¹,Maggiore Adriana¹,Mahony Maire¹,Miller Kieran¹,Nagarajan Rahul¹,Narayanaswami Ravi¹,Ni Ray¹,Nix Kathy¹,Norrie Thomas¹,Omernick Mark¹,Penukonda Narayana¹,Phelps Andy¹,Ross Jonathan¹,Ross Matt¹,Salek Amir¹,Samadiani Emad¹,Severn Chris¹,Sizikov Gregory¹,Snelham Matthew¹,Souter Jed¹,Steinberg Dan¹,Swing Andy¹,Tan Mercedes¹,Thorson Gregory¹,Tian Bo¹,Toma Horia¹,Tuttle Erick¹,Vasudevan Vijay¹,Walter Richard¹,Wang Walter¹,Wilcox Eric¹,Yoon Doe Hyun¹

Affiliation:

1. Google, Inc., Mountain View, CA USA

Abstract

Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU) --- deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X -- 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X -- 80X higher. Moreover, using the CPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3140659.3080246

Reference61 articles.

1. Abadi M. Agarwal A. Barham P. Brevdo E. Chen Z. Citro C. Corrado G.S. Davis A. Dean J. Devin M. Ghemawat S. etal 2016. TensorFlow: Large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467. Abadi M. Agarwal A. Barham P. Brevdo E. Chen Z. Citro C. Corrado G.S. Davis A. Dean J. Devin M. Ghemawat S. et al. 2016. TensorFlow: Large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467.

2. Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing

3. Adolf R. Rama S. Reagen B. Wei G.Y. and Brooks D. 2016 September. Fathom: reference workloads for modern deep learning methods. IEEE Int'l Symp. on Workload Characterization (IISWC). Adolf R. Rama S. Reagen B. Wei G.Y. and Brooks D. 2016 September. Fathom: reference workloads for modern deep learning methods. IEEE Int'l Symp. on Workload Characterization (IISWC).

Cited by 211 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Ternary Content-Addressable Memory Based on a Single Two-Dimensional Transistor for Memory-Augmented Learning;ACS Nano;2024-08-13

2. Parallel GEMM-based convolutions for deep learning on multicore ARM and RISC-V architectures;Journal of Systems Architecture;2024-08

3. The Role of Field-Programmable Gate Arrays in the Acceleration of Modern High-Performance Computing Workloads;Computer;2024-07

4. Utilizing Dual-Port FeFETs for Energy-Efficient Binary Neural Network Inference Accelerators;IEEE Transactions on Electron Devices;2024-07

5. MECLA: Memory-Compute-Efficient LLM Accelerator with Scaling Sub-matrix Partition;2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA);2024-06-29