Hacker News new | ask | show | jobs
by kevmo314 1301 days ago
> Maybe we can’t eliminate them, but that doesn’t mean there’s nothing we can do about them.

Specifically for the request case, I recall seeing a talk suggestion that processing requests last-in-first-out resolves this failure mode as well as reduces the 99th percentile latency. The intuition is if a request is going to be late, might as well abandon it and process it later because it's already late instead of doing a slow job and making everything else slow in the process.

Maybe that's a strange metaphor for work too.

3 comments

> as well as reduces the 99th percentile latency

How does it manage to do that?

When you're getting close to your limits and requests are actually waiting in the queue, I would expect FIFO to slow down everything to provide backpressure, while LIFO keeps a very nice median speed but has an increasing percentage of requests that timeout and retry once or even twice.

Are there significant dynamics I'm not thinking of?

Maybe if you have bursts that are just big enough for FIFO to delay >1% of total requests, but small enough for LIFO to drop <1%? But in that situation giving a single percentile paints a misleading picture; 99th would do better on LIFO but it comes at the cost of trashing higher percentiles.

If you have a long standing queue, switching to LIFO makes sense. It requires you to know how long the requests have been waiting, however, and many servers don't even have this basic information. Every request needs to come with a deadline and a timestamp, so the request processor can make rational decisions about processing or dropping it. If a service finds a request that arrived a long time ago and sat in the queue until the deadline was in the past, that's a fairly good signal to temporarily switch to LIFO processing.

LIFO processing in the steady state is not a great idea because it will stochastically starve some requests for no reason.

In the steady state, the queue or stack should be nearly empty, no? It's only when shocks occur, as the article describes, that a lock convoy forms.
I know this isn't any kind of silver bullet, but if you have these kinds of time constraints, wouldn't building your solution on a real time OS make more sense?