Offline Reinforcement Learning Boosts Multi-Step Reasoning in LLMs

Popular：

Virtualization DNS security formal verification reachability analysis compiler errors macro conflict web extension development framework Bitmap Graphics API inconsistencies All Tags

Offline Reinforcement Learning Boosts Multi-Step Reasoning in LLMs

2024-12-23

Researchers introduce OREO, an offline reinforcement learning method designed to enhance the multi-step reasoning capabilities of large language models (LLMs). Building upon maximum entropy reinforcement learning, OREO jointly learns a policy model and value function by optimizing the soft Bellman equation. This addresses limitations of Direct Preference Optimization (DPO) in multi-step reasoning, specifically the need for extensive paired preference data and the challenge of effective credit assignment. Experiments demonstrate OREO's superiority over existing offline learning methods on benchmarks involving mathematical reasoning and embodied agent control.

(arxiv.org)

AI Multi-step Reasoning

uBlock Origin: A Highly Efficient Ad Blocker

JMAP Turns 10: A Decade of Open Email Protocol