Lightweight GRPO Training: No Transformers, No vLLM

2025-04-13
Lightweight GRPO Training: No Transformers, No vLLM

This project implements a lightweight GRPO (Group Relative Policy Optimization) training framework, built almost from scratch, relying only on tokenizers and PyTorch. It improves upon the original algorithm by removing KL divergence and incorporating overlong episode filtering, enhancing training stability and GPU memory usage. The project trains the Qwen2.5-3B-Instruct model on the CountDown task, which requires generating a mathematical expression to reach a target value given a set of numbers. The model solves this by learning to generate chain-of-thought reasoning before the final answer, guided by format and answer rewards. The entire process is straightforward and reproducible, running on a single A40 GPU with minimal commands.

Development