![Anthropic Unveils the Strongest Defense Against AI Jailbreaks Yet](https://singularityhub.com/uploads/2025/02/man-in-hat-behind-bars.jpeg?auto=webp)
The company offered hackers $15,000 to crack the system. No one could.
Despite considerable efforts to prevent AI chatbots from providing harmful responses, they’re vulnerable to jailbreak prompts that sidestep safety mechanisms. Anthropic has now unveiled the strongest protection against these kinds of attacks to date.
One of the greatest strengths of large language models is their generality. This makes it possible to apply them to a wide range of natural language tasks from translator to research assistant to writing coach.
But this also makes it hard to predict how people will exploit them. Experts worry they could be used for a variety of harmful tasks, such as generating misinformation, automating hacking workflows, or even helping people build bombs, dangerous chemicals, or bioweapons.
AI companies go to great lengths to prevent their models from producing this kind of material—training the algorithms with human feedback to avoid harmful outputs, implementing filters for malicious prompts, and enlisting hackers to circumvent defenses so the holes can be patched.
Yet most models are still vulnerable to so-called jailbreaks—inputs designed to sidestep these protections. Jailbreaks can be accomplished with unusual formatting, such as random capitalization, swapping letters for numbers, or asking the model to adopt certain personas that ignore restrictions.
Now though, Anthropic says it’s developed a new approach that provides the strongest protection against these attacks so far. To prove its effectiveness, the company offered hackers a $15,000 prize to crack the system. No one claimed the prize, despite people spending 3,000 hours trying.
The technique involves training filters that both block malicious prompts and detect when the model is outputting harmful material. To do this, the company created what it calls a constitution. This is a list of principles governing the kinds of responses the model is allowed to produce.
In research outlined in a non-peer-reviewed paper posted to arXiv, the company created a constitution to prevent the model from generating content that could aid in the building of chemical weapons. The constitution was then fed into the company’s Claude chatbot to produce a large number of prompts and responses covering both acceptable and unacceptable topics.
The responses were then used to fine-tune two instances of the company’s smallest AI model Claude Haiku—one to filter out inappropriate prompts and another to filter out harmful responses. The output filter operates in real-time as a response is generated, allowing the filter to cut off the output partway through if it detects that it’s heading in a harmful direction.
They used these filters to protect the company’s larger Claude Sonnet model as it responded to prompts from 183 participants in a red-teaming hacking competition. Participants tried to find a universal jailbreak—a technique to bypass all the model’s defenses. To succeed, they had to get the model to answer every one of 10 forbidden queries, something none of them achieved.
To further evaluate the approach, the researchers used another large language model to generate 10,000 synthetic jailbreaking prompts, including ones deliberately designed to work around the new safety features. They then subjected two versions of Claude Sonnet to these jailbreaking prompts, one protected by the new filter and one that wasn’t. The vanilla version of Claude responded to 86 percent of the prompts, but the one protected by the new system only responded to 4.4 percent.
One downside of these kinds of filters is they may block legitimate prompts, but the researchers found the refusal rate only increased by 0.38 percent. The filter did lead to a 23.7 percent increase in compute costs, however, which could be significant in commercial deployments.
It’s also important to remember that although the approach significantly improved defenses against universal prompts that could crack all 10 forbidden queries, many individual queries did slip through. Nonetheless, the researchers say the lack of universal jailbreaks makes their filters much harder to get past. They also suggest they should be used in conjunction with other techniques.
“While these results are promising, common wisdom suggests that system vulnerabilities will likely emerge with continued testing,” they write. “Responsibly deploying advanced AI models with scientific capabilities will thus require complementary defenses.”
Building these kinds of defenses is always a cat-and-mouse game with attackers, so this is unlikely to be the last word in AI safety. But the discovery of a much more reliable way to constrain harmful outputs is likely to significantly increase the number of areas in which AI can be safely deployed.
The post Anthropic Unveils the Strongest Defense Against AI Jailbreaks Yet appeared first on SingularityHub.
* This article was originally published at Singularity Hub
0 Comments