Algorithm 942

Author:

de la Cruz Raúl1,Araya-Polo Mauricio2

Affiliation:

1. Barcelona Supercomputing Center

2. Repsol USA

Abstract

Finite Difference (FD) is a widely used method to solve Partial Differential Equations (PDE). PDEs are the core of many simulations in different scientific fields, such as geophysics, astrophysics, etc. The typical FD solver performs stencil computations for the entire computational domain, thus solving the differential operators. In general terms, the stencil computation consists of a weighted accumulation of the contribution of neighbor points along the cartesian axis. Therefore, optimizing stencil computations is crucial in reducing the application execution time. Stencil computation performance is bounded by two main factors: the memory access pattern and the inefficient reuse of the accessed data. We propose a novel algorithm, named Semi-stencil , that tackles these two problems. The main idea behind this algorithm is to change the way in which the stencil computation progresses within the computational domain. Instead of accessing all required neighbors and adding all their contributions at once, the Semi-stencil algorithm divides the computation into several updates. Then, each update gathers half of the axis neighbors, partially computing at the same time the stencil in a set of closely located points. As Semi-stencil progresses through the domain, the stencil computations are completed on precomputed points. This computation strategy improves the memory access pattern and efficiently reuses the accessed data. Our initial target architecture was the Cell/B.E., where the Semi-stencil in a SPE was 44% faster than the naive stencil implementation. Since then, we have continued our research on emerging multicore architectures in order to assess and extend this work on homogeneous architectures. The experiments presented combine the Semi-stencil strategy with space- and time-blocking algorithms used in hierarchical memory architectures. Two x86 (Intel Nehalem and AMD Opteron) and two POWER (IBM POWER6 and IBM BG/P) platforms are used as testbeds, where the best improvements for a 25-point stencil range from 1.27 to 1.76× faster. The results show that this novel strategy is a feasible optimization method which may be integrated into auto-tuning frameworks. Also, since all current architectures are multicore based, we have introduced a brief section where scalability results on IBM POWER7-, Intel Xeon-, and MIC-based systems are presented. In a nutshell, the algorithm scales as well as or better than other stencil techniques. For instance, the scalability of Semi-stencil on MIC for a certain testcase reached 93.8 × over 244 threads.

Funder

Ministerio de Ciencia e Innovación

Seventh Framework Programme

Partnership for Advanced Computing in Europe AISBL

Publisher

Association for Computing Machinery (ACM)

Subject

Applied Mathematics,Software

Reference34 articles.

1. Software pipelining

2. Efficient Formalism for Large-ScaleAb InitioMolecular Dynamics based on Time-Dependent Density Functional Theory

3. ANAG. 2012. Chombo software package for amr applications. Applied Numerical Algorithms Group (ANAG) Lawrence Berkeley National Laboratory Berkeley CA. http://seesar.lbl.gov/anag/software.html. ANAG. 2012. Chombo software package for amr applications. Applied Numerical Algorithms Group (ANAG) Lawrence Berkeley National Laboratory Berkeley CA. http://seesar.lbl.gov/anag/software.html.

4. 3D seismic imaging through reverse-time migration on homogeneous and heterogeneous multi-core processors. Sci;Araya-Polo Mauricio;Program. Cell Process.,2008

Cited by 24 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Stencil Computation with Vector Outer Product;Proceedings of the 38th ACM International Conference on Supercomputing;2024-05-30

2. Scalable Distributed High-Order Stencil Computations;SC22: International Conference for High Performance Computing, Networking, Storage and Analysis;2022-11

3. Toward accelerated stencil computation by adapting tensor core unit on GPU;Proceedings of the 36th ACM International Conference on Supercomputing;2022-06-28

4. On the Transformation Optimization for Stencil Computation;Electronics;2021-12-23

5. DRStencil: Exploiting Data Reuse within Low-order Stencil on GPU;2021 IEEE 23rd Int Conf on High Performance Computing & Communications; 7th Int Conf on Data Science & Systems; 19th Int Conf on Smart City; 7th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys);2021-12

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3