| HN Mirror

Google's SRE book covers some of this (if you aren't cheekily referring to that). E.g. chapters 21 and 22 are "Handling Overload" and "Addressing Cascading Failures". The SRE book also covers mitigation by operators (e.g. manually setting traffic to 0 at load balancer and ramping back up, manually increasing capacity), but it also talks about engineering the service in the first place.

This is definitely a familiar problem if you rely on caches for throughput (I think caches are most often introduced for latency, but eventually the service is rescaled to traffic and unintentionally needs the cache for throughput). You can e.g. pre-warm caches before accepting requests or load-shed. Load-shedding is really good and more general than pre-warming, so it's probably a great idea to deploy throughout the service anyway. You can also load-shed on the client, so servers don't even have to accept, shed, then close a bunch of connections.

The more general pattern to load-shedding is to make sure you handle a subset of the requests well instead of degrading all requests equally. E.g. processing incoming requests FIFO means that as queue sizes grow, all requests become slower. Using LIFO will allow some requests to be just as fast and the rest will timeout.