Anthropic's Constitutional Classifiers: A New Defense Against AI Jailbreaks

2025-02-03
Anthropic's Constitutional Classifiers: A New Defense Against AI Jailbreaks

Anthropic's Safeguards Research Team unveils Constitutional Classifiers, a novel defense against AI jailbreaks. This system, trained on synthetic data, effectively filters harmful outputs while minimizing false positives. A prototype withstood thousands of hours of human red teaming, significantly reducing jailbreak success rates, though initially suffering from high refusal rates and computational overhead. An updated version maintains robustness with only a minor increase in refusal rate and moderate compute cost. A temporary live demo invites security experts to test its resilience, paving the way for safer deployment of increasingly powerful AI models.