Hacker News new | ask | show | jobs
by Twirrim 1002 days ago
I remember the regionalisation, that was "fun" to be on the sidelines for (I was in a newer service that was regionalised from the get-go). I don't remember who the PM was for that one, but I remember that being when I truly came to respect the value that a TPM can add.

You're right about the cost and need to replace network equipment being one of the strong reasons why they didn't. Amazon used its own in-house designed and built network gear for a variety of reasons (IIRC there's a re:invent talk about it), which I'm sure is probably still the case. Every single one of those machines had fixed memory capacity and would need to be replaced to bump up the memory sufficiently large enough to handle IPv6 routing table needs etc. What they had wouldn't even be enough if they'd have chosen to go IPv6 Only (which you couldn't get through except via dual stack IPv4/IPv6 anyway).

1 comments

Were they also by chance considered accelerators for encrypted traffic?

I'm not privy to details, but I recall once when a mandate was issued to a Java platform to remove an outdated encryption protocol (mandated by Amazon Infosec). The change was made and rolled out with little fanfare.

A few weeks later, a large outage of Amazon Video (which used said platform) occurred on a Friday evening. Root cause? The network hardware accelerators were only setup to use that outdated protocol, which in turn meant that encryption was happening in software instead. Under load, the video hosting eventually caved.

Might be specific to the hardware used for Amazon retail, but it reinforces the point of their home grown (and now aging) stack.

Maybe not the same story, but there was a sidecar service for encrypting traffic and doing access control and other things in a way that was transparent to the app (like Envoy, but without the mesh and much earlier). The original version was written by (maybe) a single engineer in Erlang. Version two was given to another team and rewritten in Java because. They had never tested at scale and every team I know who went to production with it fell over. There was some company wide deadline, but it was unusable, at the point, and the teams I was working with were gun shy to try it again since it was obvious that the owning team had know idea what the performance characteristics or system requirements were for it.

I think I switched teams before that was resolved and moved to some greenfield work where we didn’t have to worry about scale for a while, but I do believe they eventually figure it out.