OpenAI Outage: Unexpected Load from New Telemetry Service Causes Major Disruption
2024-12-16
OpenAI experienced a major service disruption on December 11th, stemming from a newly deployed telemetry service. Intended to improve reliability, this service unexpectedly generated massive Kubernetes API server load, saturating the servers and causing the Kubernetes control plane to fail in most large clusters. This led to the breakdown of DNS-based service discovery. The incident highlights the unpredictable interactions within complex systems and the challenges of testing for failure modes that only appear under full load. OpenAI restored service by scaling down clusters, blocking network access to Kubernetes admin APIs, and scaling up API servers.