| root-cause tldr: ...[adding] new capacity [to the front-end fleet] had caused all of the servers in the [front-end] fleet to exceed the maximum number of threads allowed by an operating system configuration [number of threads spawned is directly proportional to number of servers in the fleet]. As this limit was being exceeded, cache construction was failing to complete and front-end servers were ending up with useless shard-maps that left them unable to route requests to back-end clusters. fixes: ...moving to larger CPU and memory servers [and thus fewer front-end servers]. Having fewer servers means that each server maintains fewer threads. ...making a number of changes to radically improve the cold-start time for the front-end fleet. ...moving the front-end server [shard-map] cache [that takes a long time to build, up to an hour sometimes?] to a dedicated fleet. ...move a few large AWS services, like CloudWatch, to a separate, partitioned front-end fleet. ...accelerate the cellularization [0] of the front-end fleet to match what we’ve done with the back-end. [0] https://www.youtube.com/watch?v=swQbA4zub20 and https://assets.amazon.science/c4/11/de2606884b63bf4d95190a3c... |