TCX: A RISC Style Tensor Computing Extension and a Programmable Tensor Processor-Reference-Cited by-同舟云学术

TCX: A RISC Style Tensor Computing Extension and a Programmable Tensor Processor

Published:2023-04-19 Issue:3 Volume:22 Page:1-27
ISSN:1539-9087
Container-title:ACM Transactions on Embedded Computing Systems
language:en
Short-container-title:ACM Trans. Embed. Comput. Syst.

Author:

Liang Tailin¹^ORCID,Wang Lei²^ORCID,Shi Shaobo¹^ORCID,Glossner John³^ORCID,Zhang Xiaotong²^ORCID

Affiliation:

1. University of Science and Technology Beijing, China and Hua Xia General Processor Technologies, Haidian Qu, Beijing, China

2. University of Science and Technology Beijing, Beijing, China

3. University of Science and Technology Beijing, China and General Processor Technologies, New York, USA

Abstract

Neural network processors and accelerators are domain-specific architectures deployed to solve the high computational requirements of deep learning algorithms. This article proposes a new instruction set extension for tensor computing, TCX, using Reduced Instruction Set Computer (RISC) instructions enhanced with variable length tensor extensions. It features a multi-dimensional register file, dimension registers, and fully generic tensor instructions. It can be seamlessly integrated into existing RISC Instruction Set Architectures and provides software compatibility for scalable hardware implementations. We present a tensor accelerator implementation of the tensor extensions using an out-of-order RISC microarchitecture. The tensor accelerator is scalable in computation units from several hundred to tens of thousands. An optimized register renaming mechanism is described that allows for many physical tensor registers without requiring architectural support for large tensor register names. We describe new tensor load and store instructions that reduce bandwidth requirements using tensor dimension registers. Implementations may balance data bandwidth and computation utilization for different types of tensor computations such as element-wise, depthwise, and matrix-multiplication. We characterize the computation precision of tensor operations to balance area, generality, and accuracy loss for several well-known neural networks. The TCX processor runs at 1 GHz and sustains 8.2 Tera operations per second using a 4,096 multiply-accumulate compute unit. It consumes 12.8 mm 2 while dissipating 0.46W/TOPs in TSMC 28-nm technology.

Funder

National Natural Science Foundation of China

Scientific and Technological Innovation Foundation of Shunde Graduate School, USTB

Interdisciplinary research project of USTB

Fundamental Research Funds for the Central Universities

Foshan Higher Education Foundation

MAGICOM Platform of Beijing Advanced Innovation Center for Materials Genome Engineering

Publisher

Association for Computing Machinery (ACM)

Subject

Hardware and Architecture,Software

Link