Multi-core and many-core shared-memory parallel raycasting volume rendering optimization and tuning-Reference-Cited by-同舟云学术

Multi-core and many-core shared-memory parallel raycasting volume rendering optimization and tuning

Published:2012-04-03 Issue:4 Volume:26 Page:399-412
ISSN:1094-3420
Container-title:The International Journal of High Performance Computing Applications
language:en
Short-container-title:The International Journal of High Performance Computing Applications

Author:

Bethel E Wes¹,Howison Mark²

Affiliation:

1. Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA

2. Center for Computation and Visualization, Brown University, Providence, RI, USA

Abstract

Given the computing industry trend of increasing processing capacity by adding more cores to a chip, the focus of this work is tuning the performance of a staple visualization algorithm, raycasting volume rendering, for shared-memory parallelism on multi-core CPUs and many-core GPUs. Our approach is to vary tunable algorithmic settings, along with known algorithmic optimizations and two different memory layouts, and measure performance in terms of absolute runtime and L2 memory cache misses. Our results indicate there is a wide variation in runtime performance on all platforms, as much as 254% for the tunable parameters we test on multi-core CPUs and 265% on many-core GPUs, and the optimal configurations vary across platforms, often in a non-obvious way. For example, our results indicate the optimal configurations on the GPU occur at a crossover point between those that maintain good cache utilization and those that saturate computational throughput. This result is likely to be extremely difficult to predict with an empirical performance model for this particular algorithm because it has an unstructured memory access pattern that varies locally for individual rays and globally for the selected viewpoint. Our results also show that optimal parameters on modern architectures are markedly different from those in previous studies run on older architectures. In addition, given the dramatic performance variation across platforms for both optimal algorithm settings and performance results, there is a clear benefit for production visualization and analysis codes to adopt a strategy for performance optimization through auto-tuning. These benefits will likely become more pronounced in the future as the number of cores per chip and the cost of moving data through the memory hierarchy both increase.

Publisher

SAGE Publications

Subject

Hardware and Architecture,Theoretical Computer Science,Software

Link

http://journals.sagepub.com/doi/pdf/10.1177/1094342012440466

Reference39 articles.

1. Parallel Ray Casting of Visible Human on Distributed Memory Architectures

2. Streamline Integration Using MPI-Hybrid Parallelism on a Large Multicore Architecture

3. Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors

Cited by 15 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Virtual prototyping of complex optical systems on multiprocessor workstations;Optical Engineering;2022-12-07

2. The virtual prototyping of complex optical systems on multiprocessor workstations;Computational Optics 2021;2021-09-14

3. Scalar Field Comparison with Topological Descriptors: Properties and Applications for Scientific Visualization;Computer Graphics Forum;2021-06

4. On Evaluating Runtime Performance of Interactive Visualizations;IEEE Transactions on Visualization and Computer Graphics;2020-09-01

5. Pattern Learning Based Parallel Ant Colony Optimization;2017 IEEE International Symposium on Parallel and Distributed Processing with Applications and 2017 IEEE International Conference on Ubiquitous Computing and Communications (ISPA/IUCC);2017-12