Hacker News new | ask | show | jobs
by discodave 1000 days ago
From memory, the regionalization project ran from approx 2014 to 2015 or 2016.

There were also other reasons given, like the amount of internal software that used e.g. IPv4 addresses. Also, AWS likes to have 'lots of small things' instead of one big thing (regions, AZs, cells, two pizza teams, no (official) monorepo) so regionalization was part of that.

Another big reason for regionalization, other than IPv4 exhaustion was that AWS promises customers that AWS regions are completely seperate, but with one big giant network, it turns out there were all sorts of services making calls between regions that nobody had realized. I have a couple of funny examples, but that might make me too identifiable :)

6 comments

My favorite region isolation oversight was when someone realized that the perl cron job that iterated over every border router globally and applied ACL updates 2-3x per day didn't pay attention to isolation at all, and could easily have just started blackholing the entire network one device at a time if someone configured a bad rule.

The mitigation was to sort routers by hostname which began with the regional airport codes (iad, pdx, etc.), and pause for 15 minutes each time the first three letters changed to give folks on-call time to react.

Oh wonderful. 15 minutes to get the page, put down my beer, get on my computer, sign in to everything, get 2-factored 3 times AND figure out exactly what’s happening and fix it.
Chop chop!
This really would not have been true for vendor network gear of the sort AWS had been buying for years by 2014. It's possible that their own switches or the weird fabric they have internally wouldn't have worked with v6, or there were Annapurna NIC ASIC issues, but their primary vendors all would have been fine.

I'm not saying there aren't v6 issues (for some vendors, resource exhaustion might have come into play) or bugs, but there's no way it's that massive a problem. There are huge and complex all v6 networks all over the planet that have more stringent requirements (by law) than AWS DCs.

Facebook started its transition to make everything* internally IPv6 slightly before then.

It was indeed a lot of work. But worth it.

* When I was there we still had a handful of weird things that couldn't be made IPv6. If you needed to access such things you could get a dual-stack dev server.

You're talking about snowfort, and while ip exhaustion was one reason, it's also an isolation/fault tolerance/security thing.
Indeed, blast radius is a real concern that a lot of folks who try and imitate aws have to learn about the hard way.
Tell me more about these "pizza teams".
The idea is internal teams should be no bigger than what can be fed by 2 pizzas.
But I don't like working alone :(
slam dunk.
Badum tsshhhh
It’s unfortunate when you have big eaters in your team, but I suppose you can just scale up your pizza.

Pepperoni.16xlarge

oh

so they don't own 2 pizzerias? :(

ssh’ing through bastions was such a pain! We used the JMX GUI to review some AMP details from time to time, and port forwarding through the bastions was frowned upon, but our workflow was broken, what were we to do?

IIRC, early on on that project the gateways would get overwhelmed at the volume of traffic they were handling between various VPCs and had to be rolled back several times early on.

Of all the transitions I dealt with at Amazon, snowfort may have been my least favorite (though the ACL/role migration was pretty frustrating as well).