Evaluating the Hijacking Risk of AI Agents: Adversarial Testing Reveals Vulnerabilities

2025-03-16
Evaluating the Hijacking Risk of AI Agents:  Adversarial Testing Reveals Vulnerabilities

The US AI Safety Institute (US AISI) evaluated the risk of AI agent hijacking using the AgentDojo framework, testing Anthropic's Claude 3.5 Sonnet model. Key findings highlight the need for continuous improvement of evaluation frameworks, adaptive evaluations to account for evolving attack methods, and the importance of analyzing task-specific attack success rates. The study introduced new attack scenarios like remote code execution, database exfiltration, and automated phishing, demonstrating their effectiveness across different environments. This research underscores the need for iterative improvements in AI security evaluation frameworks to address the ever-evolving threat of AI agent hijacking.