New LLM Jailbreak Exploits Models' Evaluation Skills
2025-01-12

Researchers have discovered a novel LLM jailbreak technique, dubbed "Bad Likert Judge." This method leverages LLMs' ability to identify harmful content by prompting them to score such content and then requesting examples, thus generating outputs related to malware, illegal activities, harassment, and more. Tested on six state-of-the-art models across 1440 cases, the average success rate was 71.6%, reaching as high as 87.6%. The researchers recommend that maintainers of LLM applications utilize content filters to mitigate such attacks.