The load slice core microarchitecture

Author:

Carlson Trevor E.1,Heirman Wim2,Allam Osman3,Kaxiras Stefanos1,Eeckhout Lieven3

Affiliation:

1. Uppsala University, Sweden

2. Intel, ExaScience Lab

3. Ghent University, Belgium

Abstract

Driven by the motivation to expose instruction-level parallelism (ILP), microprocessor cores have evolved from simple, in-order pipelines into complex, superscalar out-of-order designs. By extracting ILP, these processors also enable parallel cache and memory operations as a useful side-effect. Today, however, the growing off-chip memory wall and complex cache hierarchies of many-core processors make cache and memory accesses ever more costly. This increases the importance of extracting memory hierarchy parallelism (MHP), while reducing the net impact of more general, yet complex and power-hungry ILP-extraction techniques. In addition, for multi-core processors operating in power- and energy-constrained environments, energy-efficiency has largely replaced single-thread performance as the primary concern. Based on this observation, we propose a core microarchitecture that is aimed squarely at generating parallel accesses to the memory hierarchy while maximizing energy efficiency. The Load Slice Core extends the efficient in-order, stall-on-use core with a second in-order pipeline that enables memory accesses and address-generating instructions to bypass stalled instructions in the main pipeline. Backward program slices containing address-generating instructions leading up to loads and stores are extracted automatically by the hardware, using a novel iterative algorithm that requires no software support or recompilation. On average, the Load Slice Core improves performance over a baseline in-order processor by 53% with overheads of only 15% in area and 22% in power, leading to an increase in energy efficiency (MIPS/Watt) over in-order and out-of-order designs by 43% and over 4.7×, respectively. In addition, for a power- and area-constrained many-core design, the Load Slice Core outperforms both in-order and out-of-order designs, achieving a 53% and 95% higher performance, respectively, thus providing an alternative direction for future many-core processors.

Publisher

Association for Computing Machinery (ACM)

Reference42 articles.

1. ARM "2GHz capable Cortex-A9 dual core processor implementation " http://www.arm.com/files/downloads/Osprey_Analyst_Presentation_v2a.pdf archived at the Internet Archive (http://archive.org). ARM "2GHz capable Cortex-A9 dual core processor implementation " http://www.arm.com/files/downloads/Osprey_Analyst_Presentation_v2a.pdf archived at the Internet Archive (http://archive.org).

2. ARM "ARM Cortex-A7 processor " http://www.arm.com/products/processors/cortex-a/cortex-a7.php. ARM "ARM Cortex-A7 processor " http://www.arm.com/products/processors/cortex-a/cortex-a7.php.

3. "Flea-flicker" Multipass Pipelining: An Alternative to the High-Power Out-of-Order Offense

Cited by 2 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Branch Runahead: An Alternative to Branch Prediction for Impossible to Predict Branches;MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture;2021-10-17

2. Criticality Driven Fetch;MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture;2021-10-17

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3