Universal Prompt Injection Bypasses Safety Guardrails on All Major LLMs

2025-04-25
Universal Prompt Injection Bypasses Safety Guardrails on All Major LLMs

Researchers at HiddenLayer have developed a novel prompt injection technique, dubbed "Policy Puppetry," that successfully bypasses instruction hierarchies and safety guardrails across all major frontier AI models, including those from OpenAI, Google, Microsoft, Anthropic, Meta, DeepSeek, Qwen, and Mistral. This technique, combining an internally developed policy technique and roleplaying, generates outputs violating AI safety policies related to CBRN threats, mass violence, self-harm, and system prompt leakage. Its transferability across model architectures and inference strategies highlights inherent flaws in relying solely on RLHF for model alignment and underscores the need for proactive security testing, especially for organizations deploying LLMs in sensitive environments.