Large Reasoning Models: Collapse and Counterintuitive Scaling

Recent Large Language Models (LLMs) have spawned Large Reasoning Models (LRMs), generating detailed reasoning traces before providing answers. While showing improvement on reasoning benchmarks, their fundamental capabilities remain poorly understood. This work investigates LRMs using controllable puzzle environments, revealing a complete accuracy collapse beyond a certain complexity threshold. Surprisingly, reasoning effort increases with complexity, then declines despite sufficient token budget. Compared to standard LLMs, three regimes emerged: (1) low-complexity tasks where standard LLMs outperform LRMs, (2) medium-complexity tasks where LRMs show an advantage, and (3) high-complexity tasks where both fail. LRMs exhibit limitations in exact computation, failing to use explicit algorithms and reasoning inconsistently. This study highlights the strengths, limitations, and crucial questions surrounding the true reasoning capabilities of LRMs.