LLM Architecture Evolution in 2025: Deep Dives into DeepSeek, OLMo, Gemma, Mistral, and Qwen

This article reviews the architectural advancements in large language models (LLMs) during 2025, focusing on open-source models like DeepSeek, OLMo, Gemma, Mistral, and Qwen. DeepSeek V3/R1 enhances computational efficiency with Multi-Head Latent Attention (MLA) and Mixture-of-Experts (MoE). OLMo 2 emphasizes RMSNorm placement, employing Post-Norm and QK-Norm. Gemma 3 utilizes sliding window attention to reduce memory requirements. Mistral Small 3.1 balances performance and speed. Qwen 3 offers both dense and MoE variants for flexibility. SmolLM3 stands out with its 3B parameter size and NoPE (No Positional Embeddings). Finally, Kimi 2 impresses with its trillion-parameter scale and the Muon optimizer. These models showcase innovations in attention mechanisms, normalization, MoE, and optimizers, demonstrating the diversity and ongoing evolution of LLM architectures.