Best-of-N Jailbreaking: A Novel Attack on AI Systems
2024-12-15
Researchers have developed a new AI attack algorithm called Best-of-N (BoN) Jailbreaking. This black-box algorithm repeatedly modifies prompts—randomly shuffling or capitalizing text, for example—until it elicits a harmful response from the AI system. BoN achieved impressively high attack success rates (ASRs) on closed-source language models like GPT-4o (89%) and Claude 3.5 Sonnet (78%), effectively circumventing existing defenses. Furthermore, BoN seamlessly extends to vision and audio language models, highlighting the vulnerability of even advanced AI systems to seemingly innocuous input variations. This research underscores significant security concerns in the field of AI.