| > ...all instances might OOM near the same time. CloudFront, for this reason, allocates heterogeneous fleets in its PoPs which have diff RAM sizes and CPUs [0], and even different software versions [1]. > When they all restart with cold caches, they might hammer the database again and cause the issue to repeat. Reminds me of the DynamoDB outage of 2015 that essentially took out us-east-1 [2]. Also, ELB had a similar outage due to unending backlog of work [3]. Someone must write a book on design patterns for distributed system outages or something? [0] https://youtube.com/watch?v=pq6_Bd24Jsw&t=50m40s [1] https://youtube.com/watch?v=n8qQGLJeUYAt=39m0s [2] https://aws.amazon.com/message/5467D2/ [3] https://aws.amazon.com/message/67457/ |
This is definitely a familiar problem if you rely on caches for throughput (I think caches are most often introduced for latency, but eventually the service is rescaled to traffic and unintentionally needs the cache for throughput). You can e.g. pre-warm caches before accepting requests or load-shed. Load-shedding is really good and more general than pre-warming, so it's probably a great idea to deploy throughout the service anyway. You can also load-shed on the client, so servers don't even have to accept, shed, then close a bunch of connections.
The more general pattern to load-shedding is to make sure you handle a subset of the requests well instead of degrading all requests equally. E.g. processing incoming requests FIFO means that as queue sizes grow, all requests become slower. Using LIFO will allow some requests to be just as fast and the rest will timeout.