Hacker News new | ask | show | jobs
by singron 1211 days ago
Most of it is cellular or regional, but there are a few critical global services. The global network load balancing, network qos, and ddos prevention are more functional because they are global (i.e. you couldn't replace them with equivalent regional versions), but are often causes of issues like this. There was a push a few years ago to ensure global services had at least 99.999% uptime or make them regional. This was a 48 minute outage, so it blows that five 9 budget for 9 years.

Ex-googler, no particular knowledge of this event, information might be out of date.

1 comments

The pattern for past large google outages has been:

1. Some networking-related service has global, non-standard (compared to the rest of the company) configuration

2. The relevant VP is aware and has decided not to change it because that change is quoted as impossible

3. Some change elsewhere happens that assumes standard configuration

4. The networking service breaks and causes a global outage

5. VP is told to fix it

6. Fix rolls out in weeks, because it wasn't as hard as they said before

Often "impossible" is based on constraints like "0 downtime" "100% planned rollout, rollback scenarios" etc.

These constraints get thrown to the wind when the downtime is already happening.

I was being a bit hyperbolic, but this is the real reason. However, the VPs in question often have the authority to approve changes that don't have rollback scenarios (for example), they just don't until the shit hits the fan.