1. Bartz-Beielstein, T., et al.: Benchmarking in optimization: best practice and open issues. arXiv preprint arXiv:2007.03488 (2020)
2. Bartz-Beielstein, T., Mersmann, O., Chandrasekaran, S.: Ranking and result aggregation. In: Bartz, E., Bartz-Beielstein, T., Zaefferer, M., Mersmann, O. (eds.) Hyperparameter Tuning for Machine and Deep Learning with R: A Practical Guide, chap. 5, pp. 121–161. Springer Nature (2023). https://doi.org/10.1007/978-981-19-5170-1_5
3. Ben-Shachar, M.S., Lüdecke, D., Makowski, D.: Effectsize: estimation of effect size indices and standardized parameters. J. Open Source Softw. 5(56), 2815 (2020)
4. Benavoli, A., Corani, G., Demšar, J., Zaffalon, M.: Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis. J. Mach. Learn. Res. 18(1), 2653–2688 (2017)
5. Berger, J.O., Sellke, T.: Testing a point null hypothesis: The irreconcilability of p values and evidence. J. Am. Stat. Assoc. 82(397), 112–122 (1987)