1. Mark H. 2008. Optimizing parallel reduction in CUDA. NVIDIA CUDA SDK. Mark H. 2008. Optimizing parallel reduction in CUDA. NVIDIA CUDA SDK.
2. Martín , P.J. , Ayuso , L.F. , Torres , R. and Gavilanes , A ., 2012, July. Algorithmic strategies for optimizing the parallel reduction primitive in CUDA . In High Performance Computing and Simulation (HPCS), 2012 International Conference on (pp. 511-519) . IEEE. Martín, P.J., Ayuso, L.F., Torres, R. and Gavilanes, A., 2012, July. Algorithmic strategies for optimizing the parallel reduction primitive in CUDA. In High Performance Computing and Simulation (HPCS), 2012 International Conference on (pp. 511-519). IEEE.
3. Automatic Generation of Warp-Level Primitives and Atomic Instructions for Fast and Portable Parallel Reduction on GPUs
4. J. Luitjens , Faster Parallel Reductions on Kepler , Feb. 2014 , [online] Available: http://devblogs.nvidia.com/parallelforall/faster-parallel-reductions-kepler. J. Luitjens, Faster Parallel Reductions on Kepler, Feb. 2014, [online] Available: http://devblogs.nvidia.com/parallelforall/faster-parallel-reductions-kepler.
5. The GPU version of LASG/IAP Climate System Ocean Model version 3 (LICOM3) under the heterogeneous-compute interface for portability (HIP) framework and its large-scale application