1. Hall JD Carr NA Hart JC Cache bandwidth aware matrix multiplication on the GPU Technical Report UIUCDCSR-2003-2328 2003
2. Galoppo N Govindaraju NK Henson M Manocha D LU-GPU Efficient algorithms for solving dense linear systems on graphics hardware Proceedings of the 2005 ACM/IEEE Conference on Supercomputing 2005 3
3. NVIDIA Corporation CUDA C Programming Guide 2014 http://docs.nvidia.com/cuda/cuda-c-programming-guide/