Hacker News new | ask | show | jobs
by aristus 4869 days ago
I do perf work at Facebook, and over time I've become more and more convinced that the most crucial metric is the width of the latency histogram. Narrowing your latency band --even if it makes the average case worse-- makes so many systems problems better (top of the list: load balancing) it's not even funny.
5 comments

I can chime in here that I have had similar experiences at another large scale place :). Some requests would take a second or more to complete with the vast majority finishing in under 100MS. A solution was put in place that added about 5 MS to the average request, but also crushed the long tail(it just doesn't even exist anymore) and everything is hugely more stable and responsive.
Let's assume that it is unacceptable to have each dyno tell the router each time it finishes a request ( http://news.ycombinator.com/item?id=5217157 ). And also that the goal is to reduce worst-case latency. And also that we don't know a priori how many requests each dyno should queue before it is 'full' and rejects further requests ( http://news.ycombinator.com/item?id=5216771 ).

Proposal:

1) Once per minute (or less often if you have a zillion dynos), each dyno tells the router the maximum number of requests it had queued at any time over the past minute.

2) Using that information, the router recalculates a threshold once a minute that defines how many queued requests is "too many" (e.g. maybe if you have n dynos, you take the log(n)th-busiest-dyno's load as the threshold -- you want the threshold to only catch the tail).

3) When each request is sent to a dyno, a header fields is added that tells the dyno the current 'too many' threshold.

4) If the receiving dyno has too many, it passes the request back to the router, telling the router that it's busy ( http://news.ycombinator.com/item?id=5217157 ). The 'busy' dyno remembers that the router thinks it is 'busy'. The next time its queue is empty, it tells the router "i'm not busy anymore" (and repeats this message once per minute until it receives another request, at which point it assumes the router 'heard').

5) When a receiving dyno tells the router that it is busy, the router remembers this and stops giving requests to that dyno until the dyno tells it that it is not busy anymore.

I haven't worked on stuff like this myself, do you think that would work?

How was 5 ms added? Multiple sleep states per request?

I imagine the long tail disappears in a similar way that a traffic jam is prevented by lowering the speed limit.

I think you misunderstood: they optimized the long running requests and the optimization incurred 5ms performance loss for short requests. It is not that the additional 5ms solved the problem.
I seem to recall Google mentioning on some blog several years ago that high variance in response latency degrades user experience much more than slightly higher average request times. I can't find the link though; if anyone has it, I'd be grateful.
Jeff Dean wrote a paper on it for CACM:

http://cacm.acm.org/magazines/2013/2/160173-the-tail-at-scal...

There's a relatively easy fix for Heroku. They should do random routing with a backup second request sent if the first request times fails to respond after a relatively short period of time (say, 95th percentile latency), killing any outstanding requests when the first response comes back in. The amount of bookkeeping required for this is a lot less than full-on intelligent routing, but it can reduce tail latency dramatically since it's very unlikely that the second request will hit the same overloaded server.

From experience, this is an incredibly effective way to DoS yourself. It was the default behaviour of nginx LB ages ago. Maybe only on EngineYard. Doesn't really matter as nobody uses nginx LB anymore.

Even ignoring the POST requests problem (yup, it tried to replay those) properly cancelling a request on all levels of a multi-level rails stack is very hard/not possible in practice. So you end up DOSing the hard to scale lower levels of the stack (e.g. database) at the expense of the easy to scale LB.

It's a capacity/latency tradeoff. Perhaps I'm biased by working at Google, where capacity is cheap and latency will kill you, but IIUC Heroku runs off of AWS and their database tier needs to scale horizontally anyway, so reserving sufficient overflow capacity should simply be a matter of throwing money at the problem.
Nginx introduced least_conn lb method in 1.3.1 which makes it a bit better. http://nginx.org/en/docs/http/ngx_http_upstream_module.html#...

ha-proxy is a lot better than nginx + more flexible if you want to introduce non-http to your stack.

Shouldn't the request be canceled on all levels if you cut the HTTP connection to the frontend?

Right now, heroku has one inbound load-balancer that's out of their control (probably ELB(s)). This load balancer hits another layer of mesh routers that heroku does control, and that perform all of herokus magic. In order for "intelligent routing," which is more commonly known as "least-conn" routing to work amongst the mesh layer, all of the mesh routers would have to share state with each other in real-time, which makes this a hard problem.

Alternately, heroku can introduce a third layer between the mesh routers and the inbound random load balancer. This layer consistently hashes (http://en.wikipedia.org/wiki/Consistent_hashing) the api-key/primary key of your app, and sends you to a single mesh router for all of your requests. Mesh routers are/should be blazing fast relative to rails dynos, so that this isn't really a bottleneck for your app. Since the one mesh router can maintain connection state for your app, heroku can implement a least-conn strategy. If the mesh router dies, another router can be automatically chosen.

A relatively easy fix, for read-only or idempotent requests. Also, if long-tail latency requests wind up being run twice, this technique might accelerate tip-over saturation. Still, this 'hedged request' idea is good to keep in mind, thanks for the pointer.

The 'tied request' idea from the Dean paper is neat, too, and Heroku could possibly implement that, and give dyno request-handlers the ability to check, "did I win the race to handle this, or can this request be dropped?"

> There's a relatively easy fix for Heroku. They should do random routing with a backup second request sent if the first request times fails to respond after a relatively short period of time (say, 95th percentile latency), killing any outstanding requests when the first response comes back in. The amount of bookkeeping required for this is a lot less than full-on intelligent routing, but it can reduce tail latency dramatically since it's very unlikely that the second request will hit the same overloaded server.

Your solution doesn't work if requests aren't idempotent.

Yeah, I know. I figure that for incoming HTTP traffic it's relatively easy to balance the GET requests, and if they're doing anything remotely sane with HTTP those ought to be idempotent (if they're not, Googlebot will come along and delete their site ;-)).

For mutating requests, there's a solution as well, but it involves checksumming the request and passing the checksum along so that the database layer knows to discard duplicate requests that it's already handled. You need this anyway if there's any sort of retry logic in your application, though.

Or use a parameter that's effectively a nonce.
This is something I've read in networked game literature: players react far better to consistent and high latency than to inconsistent and low latency, even if the averages are lower in the latter case. (It might even have been a John Carmack article).
That's probably true, but the value that Heroku is selling (and they charge a lot for it!) is that you _don't_ need to deal with this - that they will balance that out for you.
Correct, this matches my observations as well. I'd trade an increase in mean latency for a decrease in worst-case latency anytime. It makes it so much easier to reason about how many resources are needed for a given workload when your latency is bounded.
> Narrowing your latency band --even if it makes the average case worse-- makes so many systems problems better (top of the list: load balancing) it's not even funny.

Yeah, it's a lot more practical than implementing QoS, isn't it?