Budget Reasoning Models Outperform Giants: Conquering Logic Puzzles with Reinforcement Learning

2025-03-06
Budget Reasoning Models Outperform Giants: Conquering Logic Puzzles with Reinforcement Learning

Researchers used reinforcement learning to train smaller, cheaper open-source language models that surpassed DeepSeek R1, OpenAI's o1 and o3-mini, and nearly matched Anthropic's Sonnet 3.7 in a reasoning-heavy game called "Temporal Clue," while being over 100x cheaper at inference time. They achieved this through careful task design, hyperparameter tuning, and the use of the Group Relative Policy Optimization (GRPO) algorithm and the torchtune library. This research demonstrates the potential of reinforcement learning to efficiently train open models for complex deduction tasks, even with limited data, achieving significant performance gains with as few as 16 training examples.

Read more
AI