How Robust are Model Rankings : A Leaderboard Customization Approach for Equitable Evaluation-Reference-Cited by-同舟云学术

How Robust are Model Rankings : A Leaderboard Customization Approach for Equitable Evaluation

Published:2021-05-18 Issue:15 Volume:35 Page:13561-13569
ISSN:2374-3468
Container-title:Proceedings of the AAAI Conference on Artificial Intelligence
language:
Short-container-title:AAAI

Author:

Mishra Swaroop,Arunkumar Anjana

Abstract

Models that top leaderboards often perform unsatisfactorily when deployed in real world applications; this has necessitated rigorous and expensive pre-deployment model testing. A hitherto unexplored facet of model performance is: Are our leaderboards doing equitable evaluation? In this paper, we introduce a task-agnostic method to probe leaderboards by weighting samples based on their 'difficulty' level. We find that leaderboards can be adversarially attacked and top performing models may not always be the best models. We subsequently propose alternate evaluation metrics. Our experiments on 10 models show changes in model ranking and an overall reduction in previously reported performance- thus rectifying the overestimation of AI systems' capabilities. Inspired by behavioral testing principles, we further develop a prototype of a visual analytics tool that enables leaderboard revamping through customization, based on an end user's focus area. This helps users analyze models' strengths and weaknesses, and guides them in the selection of a model best suited for their application scenario. In a user study, members of various commercial product development teams, covering 5 focus areas, find that our prototype reduces pre-deployment development and testing effort by 41% on average.

Publisher

Association for the Advancement of Artificial Intelligence (AAAI)

Subject

General Medicine

Cited by 5 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. LINGO : Visually Debiasing Natural Language Instructions to Support Task Diversity;Computer Graphics Forum;2023-06

2. End-User Development for Artificial Intelligence: A Systematic Literature Review;End-User Development;2023

3. Towards the Development of Disaster Management Tailored Machine Learning Systems;2022 IEEE India Council International Subsections Conference (INDISCON);2022-07-15

4. A Spectral View of Randomized Smoothing Under Common Corruptions: Benchmarking and Improving Certified Robustness;Lecture Notes in Computer Science;2022

5. PMU Tracker: A Visualization Platform for Epicentric Event Propagation Analysis in the Power Grid;IEEE Transactions on Visualization and Computer Graphics;2022