DeepSeek-V3: A 671B-Parameter Open-Source Mixture-of-Experts Language Model
2024-12-26
DeepSeek-V3 is a powerful 671-billion parameter Mixture-of-Experts (MoE) language model activating 37 billion parameters per token. Utilizing Multi-head Latent Attention (MLA) and the DeepSeekMoE architecture, it innovatively employs an auxiliary-loss-free load balancing strategy and a multi-token prediction training objective. Pre-trained on 14.8 trillion high-quality tokens, followed by supervised fine-tuning and reinforcement learning, DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models with remarkable training efficiency—only 2.788M H800 GPU hours.
AI