Hacker News new | ask | show | jobs
by ignoramous 2034 days ago
root-cause tldr:

...[adding] new capacity [to the front-end fleet] had caused all of the servers in the [front-end] fleet to exceed the maximum number of threads allowed by an operating system configuration [number of threads spawned is directly proportional to number of servers in the fleet]. As this limit was being exceeded, cache construction was failing to complete and front-end servers were ending up with useless shard-maps that left them unable to route requests to back-end clusters.

fixes:

...moving to larger CPU and memory servers [and thus fewer front-end servers]. Having fewer servers means that each server maintains fewer threads.

...making a number of changes to radically improve the cold-start time for the front-end fleet.

...moving the front-end server [shard-map] cache [that takes a long time to build, up to an hour sometimes?] to a dedicated fleet.

...move a few large AWS services, like CloudWatch, to a separate, partitioned front-end fleet.

...accelerate the cellularization [0] of the front-end fleet to match what we’ve done with the back-end.

[0] https://www.youtube.com/watch?v=swQbA4zub20 and https://assets.amazon.science/c4/11/de2606884b63bf4d95190a3c...

1 comments

I wonder how many of them are already logged engineering tasks which never got prioritized because of the aggressive push to add features.