Lightweight GRPO Training: No Transformers, No vLLM

Popular：

Virtualization DNS security formal verification reachability analysis compiler errors macro conflict web extension development framework Bitmap Graphics API inconsistencies All Tags

Lightweight GRPO Training: No Transformers, No vLLM

2025-04-13

This project implements a lightweight GRPO (Group Relative Policy Optimization) training framework, built almost from scratch, relying only on tokenizers and PyTorch. It improves upon the original algorithm by removing KL divergence and incorporating overlong episode filtering, enhancing training stability and GPU memory usage. The project trains the Qwen2.5-3B-Instruct model on the CountDown task, which requires generating a mathematical expression to reach a target value given a set of numbers. The model solves this by learning to generate chain-of-thought reasoning before the final answer, guided by format and answer rewards. The entire process is straightforward and reproducible, running on a single A40 GPU with minimal commands.

(github.com)

Development

Kennedy's Appointment of Anti-Vaccine Advocate Sparks Outrage

Debugging a flaky test with BCC's `trace`