Affiliation:
1. Uppsala University, Sweden
2. Intel, ExaScience Lab
3. Ghent University, Belgium
Abstract
Driven by the motivation to expose instruction-level parallelism (ILP), microprocessor cores have evolved from simple, in-order pipelines into complex, superscalar out-of-order designs. By extracting ILP, these processors also enable parallel cache and memory operations as a useful side-effect. Today, however, the growing off-chip memory wall and complex cache hierarchies of many-core processors make cache and memory accesses ever more costly. This increases the importance of extracting memory hierarchy parallelism (MHP), while reducing the net impact of more general, yet complex and power-hungry ILP-extraction techniques. In addition, for multi-core processors operating in power- and energy-constrained environments, energy-efficiency has largely replaced single-thread performance as the primary concern.
Based on this observation, we propose a core microarchitecture that is aimed squarely at generating parallel accesses to the memory hierarchy while maximizing energy efficiency. The Load Slice Core extends the efficient in-order, stall-on-use core with a second in-order pipeline that enables memory accesses and address-generating instructions to bypass stalled instructions in the main pipeline. Backward program slices containing address-generating instructions leading up to loads and stores are extracted automatically by the hardware, using a novel iterative algorithm that requires no software support or recompilation. On average, the Load Slice Core improves performance over a baseline in-order processor by 53% with overheads of only 15% in area and 22% in power, leading to an increase in energy efficiency (MIPS/Watt) over in-order and out-of-order designs by 43% and over 4.7×, respectively. In addition, for a power- and area-constrained many-core design, the Load Slice Core outperforms both in-order and out-of-order designs, achieving a 53% and 95% higher performance, respectively, thus providing an alternative direction for future many-core processors.
Publisher
Association for Computing Machinery (ACM)
Reference42 articles.
1. ARM "2GHz capable Cortex-A9 dual core processor implementation " http://www.arm.com/files/downloads/Osprey_Analyst_Presentation_v2a.pdf archived at the Internet Archive (http://archive.org). ARM "2GHz capable Cortex-A9 dual core processor implementation " http://www.arm.com/files/downloads/Osprey_Analyst_Presentation_v2a.pdf archived at the Internet Archive (http://archive.org).
2. ARM "ARM Cortex-A7 processor " http://www.arm.com/products/processors/cortex-a/cortex-a7.php. ARM "ARM Cortex-A7 processor " http://www.arm.com/products/processors/cortex-a/cortex-a7.php.
3. "Flea-flicker" Multipass Pipelining: An Alternative to the High-Power Out-of-Order Offense
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Branch Runahead: An Alternative to Branch Prediction for Impossible to Predict Branches;MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture;2021-10-17
2. Criticality Driven Fetch;MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture;2021-10-17