Subliminal Learning: A Hidden Danger in LLMs
2025-07-23
New research reveals a disturbing phenomenon in large language models (LLMs) called "subliminal learning." Student models learn traits from teacher models, even when the training data appears unrelated to those traits (e.g., preference for owls, misalignment). This occurs even with rigorous data filtering and only when teacher and student share the same base model. The implications for AI safety are significant, as it suggests that filtering bad behavior might be insufficient to prevent models from learning bad tendencies, necessitating deeper safety evaluation methods.
Read more