Google Cloud's Massive API Outage: A Null Pointer Exception's Ripple Effect
On June 12th, Google Cloud and Google Workspace products suffered a widespread outage due to a surge of 503 errors in external API requests. The root cause was a new feature in the Service Control system lacking proper error handling and feature flag protection, leading to a null pointer exception that triggered a cascading failure. A policy change containing invalid fields activated this flaw, resulting in a global service disruption. Google swiftly mitigated the issue, but some regions (like us-central-1) experienced prolonged recovery due to infrastructure overload. The incident highlighted issues in Google's error handling, feature flag usage, system architecture modularity, and monitoring and communication, prompting a commitment to implement comprehensive improvements to prevent recurrence.
Read more