Baseten Achieves SOTA Performance on GPT-OSS-120B: A Race Against Time

2025-08-07
Baseten Achieves SOTA Performance on GPT-OSS-120B: A Race Against Time

As a launch partner for OpenAI's new open-source LLM, Baseten raced to optimize GPT-OSS-120B for peak performance on launch day. They leveraged their flexible inference stack, testing across TensorRT-LLM, vLLM, and SGLang, supporting both Hopper and Blackwell GPU architectures. Key optimizations included KV cache-aware routing and speculative decoding with Eagle. Prioritizing latency, they chose Tensor Parallelism and utilized the TensorRT-LLM MoE backend. The team rapidly addressed compatibility issues and continuously refined model configuration, contributing back to the open-source community. Future improvements will include speculative decoding for even faster inference.

Read more