tiny-llm: LLM Serving in a Week – A Hands-on Tutorial

2025-04-28
tiny-llm: LLM Serving in a Week – A Hands-on Tutorial

tiny-llm is a tutorial guiding you through building an LLM serving infrastructure in a week. It focuses on using MLX's array/matrix APIs, eschewing high-level neural network APIs to build from scratch and understand optimizations. The tutorial covers core concepts like attention mechanisms, RoPE, and grouped query attention, progressing to model loading and response generation. Currently, attention, RoPE, and model loading are complete. Future chapters will delve into KV caching, quantized matrix multiplication, Flash Attention, and other optimizations, aiming for efficient LLM serving for models like Qwen2.

Development Model Serving