Offline Reinforcement Learning Boosts Multi-Step Reasoning in LLMs
Researchers introduce OREO, an offline reinforcement learning method designed to enhance the multi-step reasoning capabilities of large language models (LLMs). Building upon maximum entropy reinforcement learning, OREO jointly learns a policy model and value function by optimizing the soft Bellman equation. This addresses limitations of Direct Preference Optimization (DPO) in multi-step reasoning, specifically the need for extensive paired preference data and the challenge of effective credit assignment. Experiments demonstrate OREO's superiority over existing offline learning methods on benchmarks involving mathematical reasoning and embodied agent control.
Read more