OpenAI Unleashes gpt-oss: Powerful, Locally-Runnable Open-Weight LLMs

2025-08-10
OpenAI Unleashes gpt-oss: Powerful, Locally-Runnable Open-Weight LLMs

OpenAI this week released gpt-oss-120b and gpt-oss-20b, their first open-weight models since GPT-2 in 2019. Surprisingly, thanks to clever optimizations, they can run locally. This article delves into the gpt-oss model architecture, comparing it to models like GPT-2 and Qwen3. It highlights unique architectural choices such as Mixture-of-Experts (MoE), Grouped Query Attention (GQA), and sliding window attention. While benchmarks show gpt-oss performing on par with closed-source models in some areas, its local runnability and open-source nature make it a valuable asset for research and applications.

Read more

LLM Architecture Evolution in 2025: Deep Dives into DeepSeek, OLMo, Gemma, Mistral, and Qwen

2025-07-20
LLM Architecture Evolution in 2025: Deep Dives into DeepSeek, OLMo, Gemma, Mistral, and Qwen

This article reviews the architectural advancements in large language models (LLMs) during 2025, focusing on open-source models like DeepSeek, OLMo, Gemma, Mistral, and Qwen. DeepSeek V3/R1 enhances computational efficiency with Multi-Head Latent Attention (MLA) and Mixture-of-Experts (MoE). OLMo 2 emphasizes RMSNorm placement, employing Post-Norm and QK-Norm. Gemma 3 utilizes sliding window attention to reduce memory requirements. Mistral Small 3.1 balances performance and speed. Qwen 3 offers both dense and MoE variants for flexibility. SmolLM3 stands out with its 3B parameter size and NoPE (No Positional Embeddings). Finally, Kimi 2 impresses with its trillion-parameter scale and the Muon optimizer. These models showcase innovations in attention mechanisms, normalization, MoE, and optimizers, demonstrating the diversity and ongoing evolution of LLM architectures.

Read more

Four Approaches to Building Reasoning Models for LLMs

2025-02-06
Four Approaches to Building Reasoning Models for LLMs

This article explores four main approaches to enhancing Large Language Models (LLMs) with reasoning capabilities: inference-time scaling, pure reinforcement learning, supervised fine-tuning plus reinforcement learning, and model distillation. The development of DeepSeek R1 is used as a case study, showcasing how these methods can build powerful reasoning models, and how even budget-constrained researchers can achieve impressive results through distillation. The article also compares DeepSeek R1 to OpenAI's o1 and discusses strategies for building cost-effective reasoning models.

Read more