Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading

Author:

Lo Jack L.1,Emer Joel S.2,Levy Henry M.1,Stamm Rebecca L.2,Tullsen Dean M.3,Eggers S. J.

Affiliation:

1. Univ. of Washington, Seattle

2. Digital Equipment Corporation, Hudson, MA

3. Univ. of California, San Diego

Abstract

To achieve high performance, contemporary computer systems rely on two forms of parallelism: instruction-level parallelism (ILP) and thread-level parallelism (TLP). Wide-issue super-scalar processors exploit ILP by executing multiple instructions from a single program in a single cycle. Multiprocessors (MP) exploit TLP by executing different threads in parallel on different processors. Unfortunately, both parallel processing styles statically partition processor resources, thus preventing them from adapting to dynamically changing levels of ILP and TLP in a program. With insufficient TLP, processors in an MP will be idle; with insufficient ILP, multiple-issue hardware on a superscalar is wasted. This article explores parallel processing on an alternative architecture, simultaneous multithreading (SMT), which allows multiple threads to complete for and share all of the processor's resources every cycle. The most compelling reason for running parallel applications on an SMT processor is its ability to use thread-level parallelism and instruction-level parallelism interchangeably. By permitting multiple threads to share the processor's functional units simultaneously, the processor can use both ILP and TLP to accommodate variations in parallelism. When a program has only a single thread, all of the SMT processor's resources can be dedicated to that thread; when more TLP exists, this parallelism can compensate for a lack of per-thread ILP. We examine two alternative on-chip parallel architectures for the next generation of processors. We compare SMT and small-scale, on-chip multiprocessors in their ability to exploit both ILP and TLP. First, we identify the hardware bottlenecks that prevent multiprocessors from effectively exploiting ILP. Then, we show that because of its dynamic resource sharing, SMT avoids these inefficiencies and benefits from being able to run more threads on a single processor. The use of TLP is especially advantageous when per-thread ILP is limited. The ease of adding additional thread contexts on an SMT (relative to adding additional processors on an MP) allows simultaneous multithreading to expose more parallelism, further increasing functional unit utilization and attaining a 52% average speedup (versus a four-processor, single-chip multiprocessor with comparable execution resources). This study also addresses an often-cited concern regarding the use of thread-level parallelism or multithreading: interference in the memory system and branch prediction hardware. We find the multiple threads cause interthread interference in the caches and place greater demands on the memory system, thus increasing average memory latencies. By exploiting threading-level parallelism, however, SMT hides these additional latencies, so that they only have a small impact on total program performance. We also find that for parallel applications, the additional threads have minimal effects on branch prediction.

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Reference41 articles.

1. Performance tradeoffs in multithreaded processors

2. BOYLE J. BUTLER R. DIAZ T. GLICKFELD B. LUSK E. OVERBEEK R. PATTERSON J. AND STEVENS R. 1987. Portable Programs for Parallel Processors. Holt Rinehart and Winston New York. BOYLE J. BUTLER R. DIAZ T. GLICKFELD B. LUSK E. OVERBEEK R. PATTERSON J. AND STEVENS R. 1987. Portable Programs for Parallel Processors. Holt Rinehart and Winston New York.

Cited by 80 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. ScaleCache: A Scalable Page Cache for Multiple Solid-State Drives;Proceedings of the Nineteenth European Conference on Computer Systems;2024-04-22

2. Performance Tuning via Lean Measurements for Acceleration of Network Functions Virtualization;IEEE/ACM Transactions on Networking;2023-02

3. Coherency Traffic Reduction in Manycore Systems;2022 25th Euromicro Conference on Digital System Design (DSD);2022-08

4. Exploit the data level parallelism and schedule dependent tasks on the multi-core processors;Information Sciences;2022-03

5. PARMA: Parallelization-Aware Run-Time Management for Energy-Efficient Many-Core Systems;IEEE Transactions on Computers;2020-10-01

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3