Affiliation:
1. Interchange Forum for Reflecting on Intelligent Systems, University of Stuttgart, Stuttgart 70569, Germany
Abstract
Large language models (LLMs) are currently at the forefront of intertwining AI systems with human communication and everyday life. Thus, aligning them with human values is of great importance. However, given the steady increase in reasoning abilities, future LLMs are under suspicion of becoming able to deceive human operators and utilizing this ability to bypass monitoring efforts. As a prerequisite to this, LLMs need to possess a conceptual understanding of deception strategies. This study reveals that such strategies emerged in state-of-the-art LLMs, but were nonexistent in earlier LLMs. We conduct a series of experiments showing that state-of-the-art LLMs are able to understand and induce false beliefs in other agents, that their performance in complex deception scenarios can be amplified utilizing chain-of-thought reasoning, and that eliciting Machiavellianism in LLMs can trigger misaligned deceptive behavior. GPT-4, for instance, exhibits deceptive behavior in simple test scenarios 99.16% of the time (
P
< 0.001). In complex second-order deception test scenarios where the aim is to mislead someone who expects to be deceived, GPT-4 resorts to deceptive behavior 71.46% of the time (
P
< 0.001) when augmented with chain-of-thought reasoning. In sum, revealing hitherto unknown machine behavior in LLMs, our study contributes to the nascent field of machine psychology.
Funder
Ministry of Science, Research, and Arts Baden Württemberg
Publisher
Proceedings of the National Academy of Sciences
Reference63 articles.
1. OpenAI ChatGPT: Optimizing language models for dialogue (2022). https://openai.com/blog/chatgpt/.
2. “Model card and evaluations for Claude models” (Tech Rep 2023 Anthropic 2023).
3. R. Anil PaLM 2 technical report. arXiv [Preprint] (2023). https://doi.org/10.48550/arXiv.2305.10403 (Accessed 8 May 2024).
4. D. Hendrycks M. Mazeika T. Woodside An overview of catastrophic AI risks. arXiv [Preprint] (2023). https://doi.org/10.48550/arXiv.2306.12001 (Accessed 8 May 2024).
5. D. Hendrycks N. Carlini J. Schulman J. Steinhardt Unsolved problems in ML safety. arXiv [Preprint] (2022). https://doi.org/10.48550/arXiv.2109.13916 (Accessed 8 May 2024).
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献