
Randy Bush via Outages writes:
re: goog outage so, which was it, bgp or dns? :)
Surprisingly, neither this time! https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1SsW My interpretation, quite possibly wrong (no LLM abused, all hallucinations my own): Step 0 (29 May): A critical component ("Service Control") was updated. It had a new feature ("additional quota policy checks") that came with a bug. The bug had slipped through tests because it is triggered by (certain) "policy changes". For some reason the feature wasn't protected by a feature flag that would have allowed admins to quickly turn it off. Insufficient error handling enabled the binary to crash with a null pointer exception. (Things worked nicely for two weeks and everybody forgot about the small change...) Step 1 (12 May 1745 UTC): A new piece of policy was added. Spanner (Google's global database) dutifully repliaced the update globally within seconds. The new policy contained some "unintended blank fields" that "exercised the code path that hit the null pointer causing the binaries to go into a crash loop". Google's SRE team triaged this two minutes later, and found the root cause 10 minutes later. Between 25 and 40 minutes after the incident, a change to disable the buggy code path was rolled out. Now the official explanation talks about some "red-button" code that, on one hand, they claim to have been in the 29 May code change. But it also had to be "put in place" after the detection of the incident. So it's not like, erm, a "red button" that you could just "push" in an emergency. Anyway, ops hugs to everyone involved. Bugs happen, even to the best. This kind of policy code is in likely to be in everybody's dependency graphs, including those of customers using Google's higher-level services building on it as their lower-level infrastructure, i.e. as underlying storage for their own key-value stores, as seems to have been the case for Cloudflare. It's amazing that this stuff works so well almost all of the time. And people will learn from this and make it work even better! I mean this without any irony. (Unless it turns out that the problematic code change was vibe coded.) Cheers, -- Simon.