1. 2020. CUTLASS: Fast Linear Algebra in CUDA C++. https://developer.nvidia.com/blog/cutlass-linear-algebra-cuda/. 2020. CUTLASS: Fast Linear Algebra in CUDA C++. https://developer.nvidia.com/blog/cutlass-linear-algebra-cuda/.
2. 2021. perf: Linux profiling with performance counters. https://perf.wiki.kernel.org/index.php/Main_Page. 2021. perf: Linux profiling with performance counters. https://perf.wiki.kernel.org/index.php/Main_Page.
3. Arm Limited 2021. Arm Performance Libraries Reference Guide. Arm Limited. https://developer.arm.com/documentation/101004/latest/ Arm Limited 2021. Arm Performance Libraries Reference Guide. Arm Limited. https://developer.arm.com/documentation/101004/latest/
4. Timo Bingmann. 2013. Parallel Memory Bandwidth Benchmark/Measurement. https://panthema.net/2013/pmbw/. Timo Bingmann. 2013. Parallel Memory Bandwidth Benchmark/Measurement. https://panthema.net/2013/pmbw/.