1. NVidia (2012) Kepler K20 $$\times $$ × application performance technical brief. [online]. http://www.nvidia.com/docs/IO/122874/K20-and-K20X-application-performance-technical-brief.pdf
2. NVidia CUDA (2013) The NVIDIA CUDA basic linear algebra subroutines. [online]. https://developer.nvidia.com/cublas
3. Innovative Computing Laboratory (ICL) (2013) Matrix algebra on GPU and multicore architectures. [online]. http://icl.cs.utk.edu/magma/
4. Church PC et al (2011) Design of multiple sequence alignment algorithms on parallel, distributed memory supercomputers. In: Engineering in medicine and biology society. EMBC. IEEE, Boston, pp 924–927
5. Nath R, Tomov S, Dong T, Dongarra J (2011) Optimizing symmetric dense matrix-vector multiplication. High performance computing, networking, storage and analysis (SC). IEEE, Seatle, pp 1–10