1. Agarwal RC, Gustavson FG, Zubair M (1992) a high performance algorithm using pre-processing for the sparse matrix-vector multiplication. In: Supercomputing’92, Minnesota, November 1992. IEEE, New York, pp 32–41
2. Asanovic K, Bodik R, Catanzaro BC, Gebis JJ, Husbands P, Keutzer K, Patterson DA, Plishker WL, Shalf J, Williams SW, Yelick KA (2006) The landscape of parallel computing research: A view from Berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley
3. Athanasaki E, Anastopoulos N, Kourtis K, Koziris N (2008) Exploring the performance limits of simultaneous multithreading for memory intensive applications. J Supercomput 44(1):64–97
4. Barrett R, Berry M, Chan TF, Demmel J, Donato JM, Dongarra J, Eijkhout V, Pozo R, Romine C, der Vorst HV (1994) Templates for the solution of linear systems: building blocks for iterative methods. SIAM, Philadelphia
5. Buttari A, Eijkhout V, Langou J, Filippone S (2005) Performance optimization and modeling of blocked sparse kernels. Technical Report ICL-UT-04-05, Innovative Computing Laboratory, University of Tennessee