Training Long-Horizon Terminal Agents with Reinforcement Learning: Terminal-Bench-RL
This project details the creation of a stable RL training infrastructure scaling to 32x H100 GPUs across 4 nodes for training long-horizon terminal-based coding agents. The author developed Terminal-Agent-Qwen3-32b, achieving the highest score on terminal-bench for Qwen3 agents *without* training! Built upon the rLLM framework, it includes custom environments and infrastructure. Using ~$1M in compute, the agent achieved 19th place on the terminal-bench leaderboard, outperforming several top agents from Stanford and OpenAI. A sophisticated system prompt and custom tools guide the agent's behavior. While a full training run was cost-prohibitive, the code and dataset are provided, inviting further research with increased compute resources.