Fast LLM Inference Engine Built From Scratch
This article details the author's journey in building an LLM inference engine from scratch using C++ and CUDA, without relying on any libraries. The process provided a deep dive into the full stack of LLM inference, from CUDA kernels to model architecture, showcasing how optimizations impact inference speed. The goal was to create a program capable of loading weights from common open-source models and performing single-batch inference on a single CPU+GPU server, iteratively improving token throughput to surpass llama.cpp. The article meticulously outlines the optimization steps on both CPU and GPU, including multithreading, weight quantization, SIMD, kernel fusion, and KV cache quantization, while analyzing bottlenecks and challenges. The final result achieves near state-of-the-art performance for local LLM inference.