MInGLE-Reference-Cited by-同舟云学术

MInGLE

Published:2016-06-27 Issue:2 Volume:13 Page:1-26
ISSN:1544-3566
Container-title:ACM Transactions on Architecture and Code Optimization
language:en
Short-container-title:ACM Trans. Archit. Code Optim.

Author:

González-álvarez Cecilia¹,Sartor Jennifer B.²,Álvarez Carlos³,Jiménez-González Daniel³,Eeckhout Lieven⁴

Affiliation:

1. Ghent University & Universitat Politècnica de Catalunya

2. Ghent University & Vrije Universiteit Brussel

3. Universitat Politècnica de Catalunya, Barcelona, Spain

4. Ghent University, Zwijnaarde, Belgium

Abstract

The end of Dennard scaling leads to new research directions that try to cope with the utilization wall in modern chips, such as the design of specialized architectures. Processor customization utilizes transistors more efficiently, optimizing not only for performance but also for power. However, hardware specialization for each application is costly and impractical due to time-to-market constraints. Domain-specific specialization is an alternative that can increase hardware reutilization across applications that share similar computations. This article explores the specialization of low-power processors with custom instructions (CIs) that run on a specialized functional unit. We are the first, to our knowledge, to design CIs for an application domain and across basic blocks, selecting CIs that maximize both performance and energy efficiency improvements. We present the Merged Instructions Generator for Large Efficiency (MInGLE), an automated framework that identifies and selects CIs. Our framework analyzes large sequences of code (across basic blocks) to maximize acceleration potential while also performing partial matching across applications to optimize for reuse of the specialized hardware. To do this, we convert the code into a new canonical representation, the Merging Diagram, which represents the code’s functionality instead of its structure. This is key to being able to find similarities across such large code sequences from different applications with different coding styles. Groups of potential CIs are clustered depending on their similarity score to effectively reduce the search space. Additionally, we create new CIs that cover not only whole-body loops but also fragments of the code to optimize hardware reutilization further. For a set of 11 applications from the media domain, our framework generates CIs that significantly improve the energy-delay product (EDP) and performance speedup. CIs with the highest utilization opportunities achieve an average EDP improvement of 3.8 × compared to a baseline processor modeled after an Intel Atom. We demonstrate that we can efficiently accelerate a domain with partially matched CIs, and that their design time, from identification to selection, stays within tractable bounds.

Funder

European Research Council under the European Community's Seventh Framework Programme

ERC

Spanish Ministry of Science and Technology

Generalitat de Catalunya

Spanish Government under the Severo Ochoa program

Publisher

Association for Computing Machinery (ACM)

Subject

Hardware and Architecture,Information Systems,Software

Link

https://dl.acm.org/doi/pdf/10.1145/2898356