Reward tampering and evolutionary computation: a study of concrete AI-safety problems using evolutionary algorithms

Author:

Nilsen Mathias K.ORCID,Nygaard Tønnes F.,Ellefsen Kai Olav

Abstract

AbstractReward tampering is a problem that will impact the trustworthiness of the powerful AI systems of the future. Reward Tampering describes the problem where AI agents bypass their intended objective, enabling unintended and potentially harmful behaviours. This paper investigates whether the creative potential of evolutionary algorithms could help ensure trustworthy solutions when facing this problem. The reason why evolutionary algorithms may help combat reward tampering is that they are able to find a diverse collection of different solutions to a problem within a single run, aiding the search for desirable solutions. Four different evolutionary algorithms were deployed in tasks illustrating the problem of reward tampering. The algorithms were designed with varying degrees of human expertise, measuring how human guidance influences the ability to discover trustworthy solutions. The results indicate that the algorithms’ ability to find and preserve trustworthy solutions is very dependent on preserving diversity during the search. Algorithms searching for behavioural diversity showed to be the most effective against reward tampering. Human expertise also showed to improve the certainty and quality of safe solutions, but even with only a minimal degree of human expertise, domain-independent diversity management was found to discover safe solutions.

Funder

University of Oslo

Publisher

Springer Science and Business Media LLC

Subject

Computer Science Applications,Hardware and Architecture,Theoretical Computer Science,Software

Reference28 articles.

1. D. Amodei, C. Olah, J. Steinhardt, P.F. Christiano, Schulman, J., Mané, D.: Concrete problems in AI safety. CoRR arxiv:1606.06565 (2016)

2. L. Cazenille, Qdpy: A python framework for quality-diversity. https://gitlab.com/leo.cazenille/qdpy (2018)

3. P. Chrabaszcz, I. Loshchilov, F. Hutter, Back to basics: benchmarking canonical evolution strategies for playing Atari, in Proceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI’18, pp. 1419–1426. AAAI Press (2018)

4. E. Conti, V. Madhavan, F.P. Such, J. Lehman, K.O. Stanley, J. Clune, Improving exploration in evolution strategies for deep reinforcement learning via a population of novelty-seeking agents. CoRR arxiv:1712.06560 (2017)

5. P.C. Dario Amodei, A. Ray, Learning from human preferences (2017). https://blog.openai.com/deep-reinforcement-learning-from-human-preferences/

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3