The Scalability Challenge of Reinforcement Learning: Can Q-Learning Handle Long Horizons?

2025-06-15

Recent years have witnessed the scalability of many machine learning objectives, such as next-token prediction, denoising diffusion, and contrastive learning. However, reinforcement learning (RL), particularly off-policy RL based on Q-learning, faces challenges in scaling to complex, long-horizon problems. This article argues that existing Q-learning algorithms struggle with problems requiring more than 100 semantic decision steps due to accumulating bias in prediction targets. Experiments show that even with abundant data and controlled variables, standard off-policy RL algorithms fail to solve complex tasks. However, horizon reduction significantly improves scalability, suggesting the need for better algorithms that directly address the horizon problem rather than solely relying on increased data and compute.

Read more