LLMs Fail to Generalize Beyond Training Data
2025-08-12

Researchers tested the generalization capabilities of large language models (LLMs) on tasks, formats, and lengths outside their training data. Results showed a dramatic drop in accuracy as the task diverged from the training distribution. Even when providing correct answers, the models often exhibited illogical reasoning or reasoning inconsistent with their answers. This suggests that chain-of-thought (CoT) reasoning in LLMs doesn't reflect true text understanding, but rather the replication of patterns learned during training. Performance also degraded sharply when presented with inputs of varying lengths or unfamiliar symbols, further highlighting the limitations in generalization.
AI