Lightweight Safety Classification Using Pruned Language Models

2024-12-19

Researchers introduce Layer Enhanced Classification (LEC), a novel lightweight technique for content safety and prompt injection classification in Large Language Models (LLMs). LEC trains a streamlined Penalized Logistic Regression (PLR) classifier on the hidden state of an LLM's optimal intermediate transformer layer. Combining the efficiency of PLR with the sophisticated language understanding of LLMs, LEC outperforms GPT-4o and specialized models. Small general-purpose models like Qwen 2.5 and architectures such as DeBERTa v3 prove robust feature extractors, effectively training with fewer than 100 high-quality examples. Crucially, intermediate transformer layers often outperform the final layer. A single general-purpose LLM can classify content safety, detect prompt injections, and generate output, or smaller LLMs can be pruned to their optimal intermediate layer for feature extraction. Consistent results across architectures suggest robust feature extraction is inherent to many LLMs.