Does cross-validation work in telling rankings apart?-Reference-Cited by-同舟云学术

Does cross-validation work in telling rankings apart?

Published:2024-08-29 Issue: Volume: Page:
ISSN:1435-246X
Container-title:Central European Journal of Operations Research
language:en
Short-container-title:Cent Eur J Oper Res

Author:

Sziklai Balázs R.^ORCID,Baranyi Máté,Héberger Károly

Abstract

AbstractAlthough cross-validation (CV) is a standard technique in machine learning and data science, its efficacy remains largely unexplored in ranking environments. When evaluating the significance of differences, cross-validation is typically coupled with statistical testing, such as the Dietterich, Alpaydin, or Wilcoxon test. In this paper, we evaluate the power and false positive error rate of the Dietterich, Alpaydin, and Wilcoxon statistical tests combined with cross-validation each operating with folds ranging from 5 to 10, resulting in a total of 18 variants. Our testing setup utilizes a ranking framework, similar to the Sum of Ranking Differences (SRD) statistical procedure: we assume the existence of a reference ranking, and distances are measured in

$$L_1$$

L 1 -norm. We test the methods under artificial scenarios as well as on real data borrowed from sports and chemistry. The choice of the optimal CV test method depends on preferences related to the minimization of errors in type I and II cases, the size of the input, and anticipated patterns in the data. Among the investigated input sizes, the Wilcoxon method with eight folds proved to be the most effective, although its performance in type I situations is subpar. While the Dietterich and Alpaydin methods excel in type I situations, they perform poorly in type II scenarios. The inadequate performances of these tests raises questions about their efficacy outside of ranking environments too.

Funder

Nemzeti Kutatási, Fejlesztési és Innovaciós Alap

Corvinus University of Budapest

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1007/s10100-024-00932-1.pdf

Reference57 articles.

1. Abonyi J, Ipkovich A, Dörgő G et al (2023) Matrix factorization-based multi-objective ranaking-what makes a good university? Plos One 18(4):1–30. https://doi.org/10.1371/journal.pone.0284078

2. Alaiz-Rodríguez R, Parnell AC (2020) An information theoretic approach to quantify the stability of feature selection and ranking algorithms. Knowl Based Syst 195:105745. https://doi.org/10.1016/j.knosys.2020.105745

3. Alpaydin E (1999) Combined 5$$\times$$2 cv F Test for comparing supervised classification learning algorithms. Neural Comput 11:1885–1892. https://doi.org/10.1162/089976699300016007

4. Barlow GW, Ballin PJ (1976) Predicting and assessing dominance from size and coloration in the polychromatic midas cichlid. Anim Behav 24(4):793–813. https://doi.org/10.1016/S0003-3472(76)80010-3

5. Bartholdi J, Tovey CA, Trick MA (1989) Voting schemes for which it can be difficult to tell who won the election. Soc Choice Welf 6(2):157–165