Hacker News new | ask | show | jobs
by consumer451 1204 days ago
My knowledge level: can use AWS console to do < 5% of what is possible.

How much more work would Google create for themselves if they had not globalized their stack? Are we talking something like 5 subsets to manage instead of 1?

4 comments

Most of it is cellular or regional, but there are a few critical global services. The global network load balancing, network qos, and ddos prevention are more functional because they are global (i.e. you couldn't replace them with equivalent regional versions), but are often causes of issues like this. There was a push a few years ago to ensure global services had at least 99.999% uptime or make them regional. This was a 48 minute outage, so it blows that five 9 budget for 9 years.

Ex-googler, no particular knowledge of this event, information might be out of date.

The pattern for past large google outages has been:

1. Some networking-related service has global, non-standard (compared to the rest of the company) configuration

2. The relevant VP is aware and has decided not to change it because that change is quoted as impossible

3. Some change elsewhere happens that assumes standard configuration

4. The networking service breaks and causes a global outage

5. VP is told to fix it

6. Fix rolls out in weeks, because it wasn't as hard as they said before

Often "impossible" is based on constraints like "0 downtime" "100% planned rollout, rollback scenarios" etc.

These constraints get thrown to the wind when the downtime is already happening.

I was being a bit hyperbolic, but this is the real reason. However, the VPs in question often have the authority to approve changes that don't have rollback scenarios (for example), they just don't until the shit hits the fan.
Assuming good automation, most of the work comes in being able to do a second of something instead of just having one. The difference in work between “single point” and “multiple point” is a lot, but increasing the multiple points beyond that isn’t too bad.

Of course, if you deploy a change to all of your separated stacks at once through some sort of automated pipeline it doesn’t matter too much. Easy to break everything simultaneously that way if there’s some difference between test and prod you didn’t realize was there.

If you get into the nitty gritty of it, it doesn't really make sense. Are you going to have 5 different load balancer software stacks, with 5 different config file languages, causing each client (say Gmail) to have to implement their config 5 different ways? That's insane.
My biggest AWS surprise bill (so far!) was due to a bug in AWS console region switching.