Improving interpretability via regularization of neural activation sensitivity-Reference-Cited by-同舟云学术

Improving interpretability via regularization of neural activation sensitivity

Published:2024-06-19 Issue:9 Volume:113 Page:6165-6196
ISSN:0885-6125
Container-title:Machine Learning
language:en
Short-container-title:Mach Learn

Author:

Moshe Ofir,Fidel Gil,Bitton Ron,Shabtai Asaf

Abstract

AbstractState-of-the-art deep neural networks (DNNs) are highly effective at tackling many real-world tasks. However, their widespread adoption in mission-critical contexts is limited due to two major weaknesses - their susceptibility to adversarial attacks and their opaqueness. The former raises concerns about DNNs’ security and generalization in real-world conditions, while the latter, opaqueness, directly impacts interpretability. The lack of interpretability diminishes user trust as it is challenging to have confidence in a model’s decision when its reasoning is not aligned with human perspectives. In this research, we (1) examine the effect of adversarial robustness on interpretability, and (2) present a novel approach for improving DNNs’ interpretability that is based on the regularization of neural activation sensitivity. We evaluate the interpretability of models trained using our method to that of standard models and models trained using state-of-the-art adversarial robustness techniques. Our results show that adversarially robust models are superior to standard models, and that models trained using our proposed method are even better than adversarially robust models in terms of interpretability.(Code provided in supplementary material.)

Funder

Ben-Gurion University

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1007/s10994-024-06549-4.pdf

Reference58 articles.

1. Allen-Zhu, Z., & Li, Y. (2022). Feature purification: How adversarial training performs robust deep learning. In 2021 IEEE 62nd annual symposium on foundations of computer science (FOCS). https://doi.org/10.1109/focs52979.2021.00098.

2. Altinisik, E., Messaoud, S., Sencar, H.T., & Chawla, S. (2022). A3T: Accuracy aware adversarial training.

3. Alvarez Melis, D., & Jaakkola, T. (2018). Towards robust interpretability with self-explaining neural networks. Advances in Neural Information Processing Systems 31.

4. Alvarez-Melis, D., & Jaakkola, T.S. (2018). On the robustness of interpretability methods. arXiv preprint arXiv:1806.08049.

5. Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.-R., & Samek, W. (2015). On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS One, 10(7), 0130140.