Fine-tuning LLMs Without Reinforcement Learning: Introducing Direct Preference Optimization (DPO)

2025-05-28

The Together platform now supports Direct Preference Optimization (DPO), a technique for aligning language models with human preferences without reinforcement learning. DPO trains models directly on preference data—prompts, preferred responses, and non-preferred responses—resulting in more helpful, accurate, and tailored AI assistants. Compared to traditional reinforcement learning methods, DPO is simpler, more efficient, and easier to implement. This post details DPO's workings, usage, and code examples, recommending a two-stage process: supervised fine-tuning (SFT) followed by DPO refinement.