Author:
Moshe Ofir,Fidel Gil,Bitton Ron,Shabtai Asaf
Abstract
AbstractState-of-the-art deep neural networks (DNNs) are highly effective at tackling many real-world tasks. However, their widespread adoption in mission-critical contexts is limited due to two major weaknesses - their susceptibility to adversarial attacks and their opaqueness. The former raises concerns about DNNs’ security and generalization in real-world conditions, while the latter, opaqueness, directly impacts interpretability. The lack of interpretability diminishes user trust as it is challenging to have confidence in a model’s decision when its reasoning is not aligned with human perspectives. In this research, we (1) examine the effect of adversarial robustness on interpretability, and (2) present a novel approach for improving DNNs’ interpretability that is based on the regularization of neural activation sensitivity. We evaluate the interpretability of models trained using our method to that of standard models and models trained using state-of-the-art adversarial robustness techniques. Our results show that adversarially robust models are superior to standard models, and that models trained using our proposed method are even better than adversarially robust models in terms of interpretability.(Code provided in supplementary material.)
Publisher
Springer Science and Business Media LLC
Reference58 articles.
1. Allen-Zhu, Z., & Li, Y. (2022). Feature purification: How adversarial training performs robust deep learning. In 2021 IEEE 62nd annual symposium on foundations of computer science (FOCS). https://doi.org/10.1109/focs52979.2021.00098.
2. Altinisik, E., Messaoud, S., Sencar, H.T., & Chawla, S. (2022). A3T: Accuracy aware adversarial training.
3. Alvarez Melis, D., & Jaakkola, T. (2018). Towards robust interpretability with self-explaining neural networks. Advances in Neural Information Processing Systems 31.
4. Alvarez-Melis, D., & Jaakkola, T.S. (2018). On the robustness of interpretability methods. arXiv preprint arXiv:1806.08049.
5. Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.-R., & Samek, W. (2015). On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS One, 10(7), 0130140.