Fine-tuning LLMs Without Reinforcement Learning: Introducing Direct Preference Optimization (DPO)

2025-05-28

The Together platform now supports Direct Preference Optimization (DPO), a technique for aligning language models with human preferences without reinforcement learning. DPO trains models directly on preference data—prompts, preferred responses, and non-preferred responses—resulting in more helpful, accurate, and tailored AI assistants. Compared to traditional reinforcement learning methods, DPO is simpler, more efficient, and easier to implement. This post details DPO's workings, usage, and code examples, recommending a two-stage process: supervised fine-tuning (SFT) followed by DPO refinement.

Read more

DeepCoder-14B: Open-Source Code Reasoning Model Matches OpenAI's o3-mini

2025-04-09
DeepCoder-14B: Open-Source Code Reasoning Model Matches OpenAI's o3-mini

Agentica and Together AI have released DeepCoder-14B-Preview, a code reasoning model fine-tuned via distributed RL from Deepseek-R1-Distilled-Qwen-14B. Achieving an impressive 60.6% Pass@1 accuracy on LiveCodeBench, it rivals OpenAI's o3-mini, using only 14B parameters. The project open-sources its dataset, code, training logs, and system optimizations, showcasing a robust training recipe built on high-quality data and algorithmic improvements to GRPO. This advancement democratizes access to high-performing code-generation models.

Read more