A neurosymbolic approach to AI alignment-Reference-Cited by-同舟云学术

A neurosymbolic approach to AI alignment

Published:2024-08-28 Issue: Volume: Page:1-12
ISSN:2949-8732
Container-title:Neurosymbolic Artificial Intelligence
language:
Short-container-title:NAI

Author:

Wagner Benedikt J.¹,d’Avlia Garcez Artur¹

Affiliation:

1. Department of Computer Science, City, University of London, London, United Kingdom

Abstract

We propose neurosymbolic integration as an approach for AI alignment via concept-based model explanation. The aim is to offer AI systems the ability to learn from human revision but also assist humans at evaluating AI capabilities. The proposed method allows users and domain experts to learn about the data-driven decision making process of large neural network models and to impose a particular behaviour onto such models. The models are queried using a symbolic logic language that acts as a lingua franca between humans and model representations. Interaction with the user then confirms or rejects a revision of the model using logical constraints that can be distilled back into the neural network. We illustrate the approach using the Logic Tensor Network framework alongside Concept Activation Vectors and apply it to Convolutional Neural Networks and the task of achieving quantitative fairness. Our results illustrate how the use of a logical language is able to provide users with a formalisation of the model’s decision making whilst allowing users to steer the model towards a given alignment constraint.

Publisher

IOS Press

Reference25 articles.

1. K. Ahmed, K.-W. Chang and G.V. den Broeck, A pseudo-semantic loss for deep generative models with logical constraints, in: NeurIPS, 2023.

2. P. Barbiero, G. Ciravegna, F. Giannini, M.E. Zarlenga, L.C. Magister, A. Tonda, P. Lio’, F. Precioso, M. Jamnik and G. Marra, Interpretable Neural-Symbolic Concept Reasoning, 2023.

3. Network Dissection: Quantifying Interpretability of Deep Visual Representations

4. S. Casper, X. Davies, C. Shi, T.K. Gilbert, J. Scheurer, J. Rando, R. Freedman, T. Korbak, D. Lindner, P. Freire, T. Wang, S. Marks, C.-R. Segerie, M. Carroll, A. Peng, P. Christoffersen, M. Damani, S. Slocum, U. Anwar, A. Siththaranjan, M. Nadeau, E.J. Michaud, J. Pfau, D. Krasheninnikov, X. Chen, L. Langosco, P. Hase, E. Bıyık, A. Dragan, D. Krueger, D. Sadigh and D. Hadfield-Menell, Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback, 2023.