Affiliation:
1. Department of Bioscience and Bioinformatics, Kyushu Institute of Technology Iizuka, Fukuoka 820-8502, Japan
Abstract
Large Language Models (LLMs), such as ChatGPT, encounter ‘jailbreak’ challenges, wherein safeguards are circumvented to generate ethically harmful prompts. This study introduces a straightforward black-box method for efficiently crafting jailbreak prompts that bypass LLM defenses. Our technique iteratively transforms harmful prompts into benign expressions directly utilizing the target LLM, predicated on the hypothesis that LLMs can autonomously generate expressions that evade safeguards. Through experiments conducted with ChatGPT (GPT-3.5 and GPT-4) and Gemini-Pro, our method consistently achieved an attack success rate exceeding 80% within an average of five iterations for forbidden questions and proved to be robust against model updates. The jailbreak prompts generated were not only naturally worded and succinct, but also challenging to defend against. These findings suggest that the creation of effective jailbreak prompts is less complex than previously believed, underscoring the heightened risk posed by black-box jailbreak attacks.
Reference31 articles.
1. (2024, March 18). OpenAI. Introducing ChatGPT. Available online: https://openai.com/blog/chatgpt.
2. Fraiwan, M., and Khasawneh, N. (2023). A Review of ChatGPT Applications in Education, Marketing, Software Engineering, and Healthcare: Benefits, Drawbacks, and Research Directions. arXiv.
3. Sallam, M. (2023). ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns. Healthcare, 11.
4. Large language models in medicine;Thirunavukarasu;Nat. Med.,2023
5. Revisiting the political biases of ChatGPT;Sasuke;Front. Artif. Intell.,2023