Affiliation:
1. Eindhoven University of Technology, Eindhoven, The Netherlands
Abstract
The shift toward parallel processor architectures has made programming and code generation increasingly challenging. To address this
programmability
challenge, this article presents a technique to fully automatically generate efficient and readable code for parallel processors (with a focus on GPUs). This is made possible by combining algorithmic skeletons, traditional compilation, and “
algorithmic species
,” a classification of program code. Compilation starts by automatically annotating C code with class information (the algorithmic species). This code is then fed into the skeleton-based source-to-source compiler
bones
to generate CUDA code. To generate efficient code,
bones
also performs optimizations including host-accelerator transfer optimization and kernel fusion. This results in a unique approach, integrating a skeleton-based compiler for the first time into an automated flow. The benefits are demonstrated experimentally for PolyBench GPU kernels, showing geometric mean speed-ups of 1.4× and 2.4× compared to
ppcg
and
Par4All
, and for five Rodinia GPU benchmarks, showing a gap of only 1.2× compared to hand-optimized code.
Publisher
Association for Computing Machinery (ACM)
Subject
Hardware and Architecture,Information Systems,Software
Reference30 articles.
1. Marco Aldinucci Marco Danelutto Peter Kilpatrick and Massimo Torquati. 2013. FastFlow: High-level and efficient streaming on multi-core. Programming Multi-core and Many-core Computing Systems 13 (January 2013). Wiley. Marco Aldinucci Marco Danelutto Peter Kilpatrick and Massimo Torquati. 2013. FastFlow: High-level and efficient streaming on multi-core. Programming Multi-core and Many-core Computing Systems 13 (January 2013). Wiley.
2. Automatic C-to-CUDA Code Generation for Affine Programs
3. A practical automatic polyhedral parallelizer and locality optimizer
Cited by
17 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献