Open Sesame! Universal Black-Box Jailbreaking of Large Language Models-Reference-Cited by-同舟云学术

Open Sesame! Universal Black-Box Jailbreaking of Large Language Models

Published:2024-08-14 Issue:16 Volume:14 Page:7150
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Lapid Raz¹²^ORCID,Langberg Ron²,Sipper Moshe¹^ORCID

Affiliation:

1. Department of Computer Science, Ben-Gurion University, Beer-Sheva 8410501, Israel

2. DeepKeep, Tel-Aviv 6701203, Israel

Abstract

Large language models (LLMs), designed to provide helpful and safe responses, often rely on alignment techniques to align with user intent and social guidelines. Unfortunately, this alignment can be exploited by malicious actors seeking to manipulate an LLM’s outputs for unintended purposes. In this paper, we introduce a novel approach that employs a genetic algorithm (GA) to manipulate LLMs when model architecture and parameters are inaccessible. The GA attack works by optimizing a universal adversarial prompt that—when combined with a user’s query—disrupts the attacked model’s alignment, resulting in unintended and potentially harmful outputs. Our novel approach systematically reveals a model’s limitations and vulnerabilities by uncovering instances where its responses deviate from expected behavior. Through extensive experiments, we demonstrate the efficacy of our technique, thus contributing to the ongoing discussion on responsible AI development by providing a diagnostic tool for evaluating and enhancing alignment of LLMs with human intent. To our knowledge, this is the first automated universal black-box jailbreak attack.

Funder

Israeli Innovation Authority through the Trust.AI consortium

Publisher

MDPI AG

Link

https://www.mdpi.com/2076-3417/14/16/7150/pdf

Reference59 articles.

1. Wang, Y., Zhong, W., Li, L., Mi, F., Zeng, X., Huang, W., Shang, L., Jiang, X., and Liu, Q. (2023). Aligning Large Language Models with Human: A Survey. arXiv.

2. Training language models to follow instructions with human feedback;Ouyang;Adv. Neural Inf. Process. Syst.,2022

3. Glaese, A., McAleese, N., Trębacz, M., Aslanides, J., Firoiu, V., Ewalds, T., Rauh, M., Weidinger, L., Chadwick, M., and Thacker, P. (2022). Improving alignment of dialogue agents via targeted human judgements. arXiv.

4. Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., and McKinnon, C. (2022). Constitutional AI: Harmlessness from AI feedback. arXiv.

5. Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. (May, January 30). Towards Deep Learning Models Resistant to Adversarial Attacks. Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada.

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Watch Your Words: Successfully Jailbreak LLM by Mitigating the “Prompt Malice”;Lecture Notes in Computer Science;2024