Autotuning Convolutions Is Easier Than You Think-Reference-Cited by-同舟云学术

Autotuning Convolutions Is Easier Than You Think

Published:2023-03 Issue:2 Volume:20 Page:1-24
ISSN:1544-3566
Container-title:ACM Transactions on Architecture and Code Optimization
language:en
Short-container-title:ACM Trans. Archit. Code Optim.

Author:

Tollenaere Nicolas¹^ORCID,Iooss Guillaume¹^ORCID,Pouget Stéphane²^ORCID,Brunie Hugo¹^ORCID,Guillon Christophe¹^ORCID,Cohen Albert³^ORCID,Sadayappan P.⁴^ORCID,Rastello Fabrice¹^ORCID

Affiliation:

1. INRIA, Grenoble, France

2. University of California Los-Angeles, Los Angeles, California, USA

3. Google, Paris, France

4. University of Utah, Utah, USA

Abstract

A wide range of scientific and machine learning applications depend on highly optimized implementations of tensor computations. Exploiting the full capacity of a given processor architecture remains a challenging task, due to the complexity of the microarchitectural features that come into play when seeking near-peak performance. Among the state-of-the-art techniques for loop transformations for performance optimization, AutoScheduler [Zheng et al. 2020a ] tends to outperform other systems. It often yields higher performance as compared to vendor libraries, but takes a large number of runs to converge, while also involving a complex training environment. In this article, we define a structured configuration space that enables much faster convergence to high-performance code versions, using only random sampling of candidates. We focus on two-dimensional convolutions on CPUs. Compared to state-of-the-art libraries, our structured search space enables higher performance for typical tensor shapes encountered in convolution stages in deep learning pipelines. Compared to auto-tuning code generators like AutoScheduler, it prunes the search space while increasing the density of efficient implementations. We analyze the impact on convergence speed and performance distribution, on two Intel x86 processors and one ARM AArch64 processor. We match or outperform the performance of the state-of-the-art oneDNN library and TVM’s AutoScheduler, while reducing the autotuning effort by at least an order of magnitude.

Funder

Bpifrance Programme d’Investissements d’Avenir (PIA) as part of the ES3CAP project

Publisher

Association for Computing Machinery (ACM)

Subject

Hardware and Architecture,Information Systems,Software

Link

https://dl.acm.org/doi/pdf/10.1145/3570641

Reference35 articles.

1. Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). USENIX Association, 265–283.

2. Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele Del Sozzo, Abdurrahman Akkas, Yunming Zhang, Patricia Suriana, Shoaib Kamil, and Saman P. Amarasinghe. 2019. Tiramisu: A polyhedral compiler for expressing fast and portable code. In IEEE/ACM International Symposium on Code Generation and Optimization (CGO’19), Mahmut Taylan Kandemir, Alexandra Jimborean, and Tipp Moseley (Eds.). IEEE, 193–205.

3. A Polynomial Time Algorithm for Counting Integral Points in Polyhedra When the Dimension is Fixed

4. Optimization space pruning without regrets

Cited by 10 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. DLAS: A Conceptual Model for Across-Stack Deep Learning Acceleration;ACM Transactions on Architecture and Code Optimization;2024-09-02

2. Improving Direct Convolution through Tensor Slicing, Vectorized Packing and ISA Extensions;Anais do XXXVII Concurso de Teses e Dissertações (CTD 2024);2024-07-21

3. The Droplet Search Algorithm for Kernel Scheduling;ACM Transactions on Architecture and Code Optimization;2024-05-21

4. A Predictable SIMD Library for GEMM Routines;2024 IEEE 30th Real-Time and Embedded Technology and Applications Symposium (RTAS);2024-05-13

5. Register Blocking: An Analytical Modelling Approach for Affine Loop Kernels;Proceedings of the 21st ACM International Conference on Computing Frontiers;2024-05-07