Google Cloud Postmortem: Quota Bug Triggered Global Outage, Hitting Cloudflare & Discord for 3 Hours

Last week, the global internet landscape experienced a wave of instability—services like Cloudflare, Spotify, Discord, Snapchat, Twitch, and many others began to falter en masse. At the heart of the disruption lay Google Cloud’s infrastructure: on June 13, clients across the world were abruptly cut off from their rented cloud resources for nearly three hours. The incident drew significant attention not only for its scale but also for the cascading nature of its impact—originating within Google and rippling outward to dependent services.

Google has now released an official technical postmortem. As with several major outages in recent years, the root cause was not a malicious intrusion or external cyberattack, but rather a flaw in the system’s quota management code. The failure was particularly insidious: it formed a self-reinforcing loop, the effects of which reverberated through multiple layers of infrastructure, including communications and monitoring systems.

It all began with what seemed like a minor update. On May 29, the Service Control component—a critical module that enforces quota and API policy compliance within Google Cloud—was enhanced with logic to support additional resource controls. This component plays a pivotal role in the platform’s architecture, processing requests via distributed regional instances and ensuring adherence to quotas, authorization, and access policies.

The update followed Google’s standard staged rollout across regions, and at first glance, all appeared to function smoothly. However, there was a subtle caveat: the altered code would only activate under very specific conditions—namely, the introduction of a new policy containing a particular data structure. During testing, such a policy had not been deployed. As a result, the flawed segment remained dormant—hidden in plain sight—awaiting a trigger.

That trigger came on June 12, when a policy was updated with fields containing empty values. This activated the previously untouched execution path. The new code attempted to access non-existent data, resulting in a null pointer error. The Service Control binary crashed instantly and entered a reboot loop. The issue replicated simultaneously across all regions, as each read the same policies and executed the same chain of operations.

Google confirmed that the new code was not safeguarded by a feature flag—a mechanism designed to disable untested behavior quickly before widespread rollout. Had such a flag been implemented, engineers could have identified and quarantined the problem instantaneously. But in this case, no such safety mechanism existed. Moreover, the absence of proper error handling rendered the failure irrecoverable without manual intervention.

Google’s Site Reliability Engineering (SRE) team responded swiftly: the incident was detected within two minutes, the root cause identified within ten, and recovery efforts began just forty minutes later. However, the complexity only grew from there. In major regions like us-central1, restarting Service Control overwhelmed the surrounding infrastructure. A mechanism built for orderly, routine operation buckled under the avalanche of simultaneous retries—a “herd effect” where identical processes flood shared dependencies all at once.

Because Service Control is a distributed system, the overload in major zones propagated downstream, complicating the recovery effort. In some regions, restoration took nearly three hours. Simultaneously, Google products dependent on Service Control—Gmail, Drive, Meet, Calendar, Voice—began to falter one by one. Meanwhile, Cloudflare customers, whose Workers KV relies on Google Cloud, experienced up to a 90% request failure rate.

Though most systems were back online by the evening of June 13, the aftermath continues to unfold. Google has pledged not only to prevent recurrence but to implement structural reforms in how infrastructure code is developed and deployed. Notably, the company aims to enhance both automated and manual communication with customers—ensuring that news of critical failures reaches them faster and with greater clarity.

One key takeaway: the notification and monitoring infrastructure must be decoupled from the rest of the cloud platform to ensure its operability, even during catastrophic outages.

Rate this post

Best GIF Compression Tools Deserve Your Try In 2025

UK Universities Grapple with AI Cheating Surge: Cases Nearly Triple in a Year

Germany’s Bold Move: Schleswig-Holstein Dumps Microsoft for Linux & Open Source

Found this helpful?

Leave a Reply Cancel reply

Related Stories

Best GIF Compression Tools Deserve Your Try In 2025

UK Universities Grapple with AI Cheating Surge: Cases Nearly Triple in a Year

Germany’s Bold Move: Schleswig-Holstein Dumps Microsoft for Linux & Open Source