Unpacking R1-Zero: Efficient LLM Alignment with the Oat Framework

2025-03-22
Unpacking R1-Zero: Efficient LLM Alignment with the Oat Framework

Researchers released a paper, models, and codebase unveiling the mysteries of R1-Zero-like training. They developed Oat, a highly modular and efficient LLM reinforcement learning framework, and used it to R1-Zero-train models like Qwen2.5. The study found that proper base models and an improved reinforcement learning algorithm (Dr. GRPO) are crucial, avoiding biased optimization from mismatched templates and question sets. Ultimately, they achieved state-of-the-art performance with only 27 hours of compute on 8x A100 GPUs.

AI