RLVR Boosts Reasoning...But at What Cost?

2025-04-22

Experiments across math, coding, and visual reasoning domains evaluated the impact of RLVR (Reinforcement Learning from Human Feedback) on base and RLVR-trained large language models. Results showed RLVR improved accuracy at low k-values but decreased problem coverage at higher k-values. This suggests RLVR enhances deterministic accuracy but limits exploration diversity. Base models maintained broader reasoning coverage despite initial accuracy gains from RL. The consistent findings across domains indicate RLVR enhances reasoning without fundamentally altering the problem-solving approach.