Hacker News new | ask | show | jobs
by aranchelk 1000 days ago
Can you remember what year it was?

I’ve got a slight suspicion you were given some bullshit or at least a creative treatment of facts e.g. everything had IPv6 support but FUD-filled network engineers didn’t want to turn it on.

Most network devices I’ve encountered were dual-stack way before anyone I knew seemed to care about actually using IPv6 — I always assumed it was added for US government/military requirements.

6 comments

From memory, the regionalization project ran from approx 2014 to 2015 or 2016.

There were also other reasons given, like the amount of internal software that used e.g. IPv4 addresses. Also, AWS likes to have 'lots of small things' instead of one big thing (regions, AZs, cells, two pizza teams, no (official) monorepo) so regionalization was part of that.

Another big reason for regionalization, other than IPv4 exhaustion was that AWS promises customers that AWS regions are completely seperate, but with one big giant network, it turns out there were all sorts of services making calls between regions that nobody had realized. I have a couple of funny examples, but that might make me too identifiable :)

My favorite region isolation oversight was when someone realized that the perl cron job that iterated over every border router globally and applied ACL updates 2-3x per day didn't pay attention to isolation at all, and could easily have just started blackholing the entire network one device at a time if someone configured a bad rule.

The mitigation was to sort routers by hostname which began with the regional airport codes (iad, pdx, etc.), and pause for 15 minutes each time the first three letters changed to give folks on-call time to react.

Oh wonderful. 15 minutes to get the page, put down my beer, get on my computer, sign in to everything, get 2-factored 3 times AND figure out exactly what’s happening and fix it.
Chop chop!
This really would not have been true for vendor network gear of the sort AWS had been buying for years by 2014. It's possible that their own switches or the weird fabric they have internally wouldn't have worked with v6, or there were Annapurna NIC ASIC issues, but their primary vendors all would have been fine.

I'm not saying there aren't v6 issues (for some vendors, resource exhaustion might have come into play) or bugs, but there's no way it's that massive a problem. There are huge and complex all v6 networks all over the planet that have more stringent requirements (by law) than AWS DCs.

Facebook started its transition to make everything* internally IPv6 slightly before then.

It was indeed a lot of work. But worth it.

* When I was there we still had a handful of weird things that couldn't be made IPv6. If you needed to access such things you could get a dual-stack dev server.

You're talking about snowfort, and while ip exhaustion was one reason, it's also an isolation/fault tolerance/security thing.
Indeed, blast radius is a real concern that a lot of folks who try and imitate aws have to learn about the hard way.
Tell me more about these "pizza teams".
The idea is internal teams should be no bigger than what can be fed by 2 pizzas.
But I don't like working alone :(
slam dunk.
Badum tsshhhh
It’s unfortunate when you have big eaters in your team, but I suppose you can just scale up your pizza.

Pepperoni.16xlarge

oh

so they don't own 2 pizzerias? :(

ssh’ing through bastions was such a pain! We used the JMX GUI to review some AMP details from time to time, and port forwarding through the bastions was frowned upon, but our workflow was broken, what were we to do?

IIRC, early on on that project the gateways would get overwhelmed at the volume of traffic they were handling between various VPCs and had to be rolled back several times early on.

Of all the transitions I dealt with at Amazon, snowfort may have been my least favorite (though the ACL/role migration was pretty frustrating as well).

Sure, everything supports IPv6 -- until you turn it on and rediscover the tickets that have been sitting at the bottom of the JIRA for the last decade.
As a matter of fact Ron Broersma who affiliated with Space and Naval Warfare Systems Command (SPAWAR) has a list of equipment that should be fully IPv6-only compliant including various management interfaces and more. The US Navy supposedly tests this in house in a IPv6-only network. 4 years later I imagine the situation only got better https://www.youtube.com/watch?v=9kQje5gSWw8

Also, AWS now have the majority of NICs and switches built in-house I imagine. The underlay network could be IPv6 or totally custom for what we know (but probably is IPv4).

Cool! I'm glad the military is pushing the internet forward, I guess some things never change :)

As for AWS, I tend to agree with the sibling post and your supposition about IPv4. Everything out of the Amazon organization is aggressively, err, "minimal."

It's their baby lol
I believe the issue wasn't of IPv6 support generally, but of issues with TCAM space and the increase in routing table size moving from v4 to v6. Overflowing TCAM would cause routing to hit the CPU which would immediately lead to outages.

Tables were relatively large internally because AWS was all in on clos networks at that point. And the devices used to build those clos networks were running Broadcom ASICs, not Cisco or other likely vendors.

Right, if you worked at Amazon and didn't have incentive, then, you didn't do it. It was part of your job to not do things which you were not incentivized to do.
Just change Amazon for any other company name and the sentence is still correct. People do they are paid to do.
Right?? How old of a device you would have to get to NOT have IPv6 support?

EDIT: But maybe bugs, IDK.

If Amazon is your customer, you fix the bugs; if you're Amazon using your in-house kit, you fix your own bugs whenever you want to. There are plenty of real reasons not to do IPv6, but they are virtually all politics and possibly operational ("we'd have to train our people, and we don't spend money on that"). The idea it was a vendor issue is a BS trope that's been around for at least a decade if not 2.
> FUD-filled network engineers

FUD sounds like a mean way to say unproven in production