Current cases of AI misalignment and their implications for future risks
Abstract
AbstractHow can one build AI systems such that they pursue the goals their designers want them to pursue? This is the alignment problem. Numerous authors have raised concerns that, as research advances and systems become more powerful over time, misalignment might lead to catastrophic outcomes, perhaps even to the extinction or permanent disempowerment of humanity. In this paper, I analyze the severity of this risk based on current instances of misalignment. More specifically, I argue that contemporary large language models and game-playing agents are sometimes misaligned. These cases suggest that misalignment tends to have a variety of features: misalignment can be hard to detect, predict and remedy, it does not depend on a specific architecture or training paradigm, it tends to diminish a system’s usefulness and it is the default outcome of creating AI via machine learning. Subsequently, based on these features, I show that the risk of AI alignment magnifies with respect to more capable systems. Not only might more capable systems cause more harm when misaligned, aligning them should be expected to be more difficult than aligning current AI.
Funder
Friedrich-Alexander-Universität Erlangen-Nürnberg
Publisher
Springer Science and Business Media LLC
Subject
General Social Sciences,Philosophy
Reference71 articles.
1. Armstrong, S., Bostrom, N., & Shulman, C. (2016). Racing to the precipice: A model of artificial intelligence development. AI & SOCIETY, 31(2), 201–206. https://doi.org/10.1007/s00146-015-0590-y. 2. Arrhenius, G., Bykvist, K., Campbell, T., & Finneron-Burns, E. (Eds.). (2022). The Oxford Handbook of Population Ethics (1st ed.). Oxford University Press. https://doi.org/10.1093/oxfordhb/9780190907686.001.0001. 3. Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., Joseph, N., Kadavath, S., Kernion, J., Conerly, T., El-Showk, S., Elhage, N., Hatfield-Dodds, Z., Hernandez, D., Hume, T., & Kaplan, J. (2022). Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (arXiv:2204.05862). arXiv. https://doi.org/10.48550/arXiv.2204.05862. 4. Baker, B., Kanitscheider, I., Markov, T., Wu, Y., Powell, G., McGrew, B., & Mordatch, I. (2020). Emergent Tool Use From Multi-Agent Autocurricula (arXiv:1909.07528). arXiv. https://doi.org/10.48550/arXiv.1909.07528. 5. Belrose, N., Furman, Z., Smith, L., Halawi, D., Ostrovsky, I., McKinney, L., Biderman, S., & Steinhardt, J. (2023). Eliciting Latent Predictions from Transformers with the Tuned Lens (arXiv:2303.08112). arXiv. http://arxiv.org/abs/2303.08112.
Cited by
7 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
|
|