Unpacking R1-Zero: Efficient LLM Alignment with the Oat Framework
2025-03-22
Researchers released a paper, models, and codebase unveiling the mysteries of R1-Zero-like training. They developed Oat, a highly modular and efficient LLM reinforcement learning framework, and used it to R1-Zero-train models like Qwen2.5. The study found that proper base models and an improved reinforcement learning algorithm (Dr. GRPO) are crucial, avoiding biased optimization from mismatched templates and question sets. Ultimately, they achieved state-of-the-art performance with only 27 hours of compute on 8x A100 GPUs.
AI