Affiliation:
1. Imperial College London, United Kingdom
2. ETH Zürich, Switzerland
Abstract
Recently, there is a trend to use high-level synthesis (HLS) tools to generate dynamically scheduled hardware. The generated hardware is made up of components connected using handshake signals. These handshake signals schedule the components at runtime when inputs become available. Such approaches promise superior performance on “irregular” source programs, such as those whose control flow depends on input data. This is at the cost of additional area. Current dynamic scheduling techniques are well able to exploit parallelism among instructions
within
each basic block (BB) of the source program, but parallelism
between
BBs is under-explored, due to the complexity in runtime control flows and memory dependencies. Existing tools allow some of the operations of different BBs to overlap, but to simplify the analysis required at compile time they require the BBs to
start
in strict program order, thus limiting the achievable parallelism and overall performance.
We formulate a general dependency model suitable for comparing the ability of different dynamic scheduling approaches to extract maximal parallelism at runtime. Using this model, we explore a variety of mechanisms for runtime scheduling, incorporating and generalising existing approaches. In particular, we precisely identify the restrictions in existing scheduling implementation and define possible optimisation solutions. We identify two particularly promising examples where the compile-time overhead is small and the area overhead is minimal and yet we are able to significantly speed up execution time: (1) parallelising consecutive independent loops; and (2) parallelising independent inner-loop instances in a nested loop as individual threads. Using benchmark sets from related works, we compare our proposed toolflow against a state-of-the-art dynamic-scheduling HLS tool called Dynamatic. Our results show that, on average, our toolflow yields a 4× speedup from (1) and a 2.9× speedup from (2), with a negligible area overhead. This increases to a 14.3× average speedup when combining (1) and (2).
Publisher
Association for Computing Machinery (ACM)
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Survival of the Fastest: Enabling More Out-of-Order Execution in Dataflow Circuits;Proceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays;2024-04