According to a new report by Anthropic, bypassing the safeguards of large language models (LLMs) remains relatively straightforward and can be automated. A team of researchers has developed the Best-of-N (BoN) Jailbreaking algorithm, which enables circumvention of the protective mechanisms in modern AI systems by employing various input modifications.
The term “jailbreaking,” originally associated with removing software restrictions on iPhones, has found new relevance in the field of artificial intelligence. In this context, jailbreaking refers to techniques that bypass built-in restrictions designed to prevent the generation of harmful content. The new method has been tested on models such as GPT-4o, Claude 3.5, Claude 3 Opus, Gemini-1.5, and Llama 3.
The BoN Jailbreaking algorithm repeatedly generates variations of the original prompt, introducing random alterations such as word rearrangement, changes in capitalization, spelling errors, or grammatical distortions. This process continues until the model produces a response to a prohibited query. For instance, if GPT-4o is directly asked how to create a bomb, the system will reject the query, citing usage policies. However, by injecting random capital letters, errors, or reordering words, the algorithm may elicit the desired response.
The researchers tested the BoN Jailbreaking method across textual, auditory, and visual data. For audio prompts, parameters like speed, pitch, and volume were modified, or noise and music were added. For images, adjustments included changes to fonts, background colors, object sizes, and positions. The method proved highly effective, achieving success rates exceeding 50% across all models, including GPT-4o and Claude 3.5, in 10,000 attempts.
The BoN Jailbreaking algorithm automates previously manual techniques for breaching security barriers. For example, earlier in 2024, reports revealed that creating inappropriate images using Microsoft’s image generator required intentional distortion of names and descriptions. Similar methods were employed to bypass protections in audio generators by introducing pauses or variations in recordings.
Anthropic emphasizes that the purpose of this research is not merely to expose vulnerabilities in modern systems but to establish a foundation for developing new protective mechanisms. Detailed analyses of successful attacks can inform the creation of more robust defenses to prevent their recurrence.
Nevertheless, unregulated AI models already exist in the market, devoid of built-in safeguards. These models enable unrestricted responses or the creation of content that violates ethical standards and legal norms, raising serious concerns about the need for stricter regulation and the adoption of advanced security technologies.