Hacker News new | ask | show | jobs
by pclmulqdq 1206 days ago
The pattern for past large google outages has been:

1. Some networking-related service has global, non-standard (compared to the rest of the company) configuration

2. The relevant VP is aware and has decided not to change it because that change is quoted as impossible

3. Some change elsewhere happens that assumes standard configuration

4. The networking service breaks and causes a global outage

5. VP is told to fix it

6. Fix rolls out in weeks, because it wasn't as hard as they said before

1 comments

Often "impossible" is based on constraints like "0 downtime" "100% planned rollout, rollback scenarios" etc.

These constraints get thrown to the wind when the downtime is already happening.

I was being a bit hyperbolic, but this is the real reason. However, the VPs in question often have the authority to approve changes that don't have rollback scenarios (for example), they just don't until the shit hits the fan.