Affiliation:
1. Dipartimento di Ingegneria e Scienza dell’Informazione, University of Trento, 38123 Trento, Italy
2. Dipartimento di Informatica, University of Pisa, 56126 Pisa, Italy
3. Centro Interdipartimentale Mente/Cervello, University of Trento, 38123 Trento, Italy
Abstract
Research on Explainable Artificial Intelligence has recently started exploring the idea of producing explanations that, rather than being expressed in terms of low-level features, are encoded in terms of interpretable concepts learned from data. How to reliably acquire such concepts is, however, still fundamentally unclear. An agreed-upon notion of concept interpretability is missing, with the result that concepts used by both post hoc explainers and concept-based neural networks are acquired through a variety of mutually incompatible strategies. Critically, most of these neglect the human side of the problem: a representation is understandable only insofar as it can be understood by the human at the receiving end. The key challenge in human-interpretable representation learning (hrl) is how to model and operationalize this human element. In this work, we propose a mathematical framework for acquiring interpretable representations suitable for both post hoc explainers and concept-based neural networks. Our formalization of hrl builds on recent advances in causal representation learning and explicitly models a human stakeholder as an external observer. This allows us derive a principled notion of alignment between the machine’s representation and the vocabulary of concepts understood by the human. In doing so, we link alignment and interpretability through a simple and intuitive name transfer game, and clarify the relationship between alignment and a well-known property of representations, namely disentanglement. We also show that alignment is linked to the issue of undesirable correlations among concepts, also known as concept leakage, and to content-style separation, all through a general information-theoretic reformulation of these properties. Our conceptualization aims to bridge the gap between the human and algorithmic sides of interpretability and establish a stepping stone for new research on human-interpretable representations.
Funder
NextGenerationEU
EU Horizon 2020 research and innovation programme
Subject
General Physics and Astronomy
Reference127 articles.
1. A survey of methods for explaining black box models;Guidotti;ACM Comput. Surv. (CSUR),2018
2. Explaining prediction models and individual predictions with feature contributions;Kononenko;Knowl. Inf. Syst.,2014
3. Ribeiro, M.T., Singh, S., and Guestrin, C. (2016, January 13–17). “Why should I Trust You?” Explaining the predictions of any classifier. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA.
4. Kim, B., Khanna, R., and Koyejo, O.O. (2016). Examples are not enough, learn to criticize! Criticism for interpretability. Adv. Neural Inf. Process. Syst., 29.
5. Koh, P.W., and Liang, P. (2017, January 6–11). Understanding black-box predictions via influence functions. Proceedings of the International Conference on Machine Learning, PMLR, Sydney, Australia.