Hacker News new | ask | show | jobs
by joneholland 3355 days ago
You know you can implement a token bucket that doesn't share state between your API servers, in about 10 lines of code, using just a in memory map.

Your incoming requests should be balanced across all of the servers so you just derive the allowed throughput and divide by the number front ends....

3 comments

This can be appropriate as a tactical short-term hack, but for anyone reading who doesn't have a good sense for when to cut corners here, this isn't a great general-purpose solution. You're building assumptions about the way your machines receive requests into your service logic instead of externalizing it in your load balancer, which isn't a good practice.

In practical terms this means that whenever the characteristics of your environment changes, your rate limiting suddenly gets wonky. If you're doing a rolling deployment and take 1/3 of your machines out of service, or if some machines go down for maintenance, or a variety of other things happen to your machines, you're going to end up with fluctuating rate limits.

It sounds like OP is running a relatively mature service, so this probably isn't the best idea for them.

This only works if your bucket size is much larger than your number of servers.

In the degenerate case, imagine a rate limit of 2 total requests per minute load balanced across 2 servers with enough traffic that my request is basically hitting a random server. In this case, 50% of the time my second request will be rate limited incorrectly because each server has a bucket of 1 and my second request went to the same server as my first.

I'm sure someone smarter than me (and better at probability) could come up with an equation where you input your rate limit & number of servers and it tells you the probability of a false positive for a user.

That's valid, if the distribution of work is not fair, this won't work.

In practice, when you are receiving enough traffic to make throttling practical you aren't usually throttling at 2 RPM across 2 servers.

Even if you have more servers, you'll still very easily hit the case of rate limiting someone too early. And that's really bad, because it means your clients, who are aware of the rate limit and structure their code to stay under it, will start getting failures they shouldn't get, and they have no way to handle it besides intentionally staying significantly under the advertised rate limit.

So if you're really set on doing something like this, you need to set the actual rate limit to be significantly higher than the advertised rate limit, such that it's extremely unlikely for a client to be rate limited too fast.

Not everybody is Google or Facebook scale.
You have to send all the traffic from one client to one server then right? Seems not without heavy drawbacks.

Otherwise you can easily get them to hit their limit super early with bad luck.

No, you evenly round robin all traffic to all servers.

Each server contains a map of tokens per per client filling at a fixed interval.

That interval is calculated by taking the total global token refresh rate and dividing it by the number of servers.

The end result is exactly the same but, now you are stateless and have eliminated the bottleneck of a central token bucket.

Wait, each client does its own round robin (if you have three servers, I will hit 1 then 2 then 3)? Is that common?
The client doesn't do it. You put your front ends behind a load balancer like an ELB, or use a reverse proxy like Nginx.

Edit: And yes, round robin is the most commonly used load distribution technique, and works very well assuming each request has a roughly equivalent unit of work cost.

I'm surprised you wouldn't run into cases where the requests being rate-limited can't end up unevenly distributed between servers. There are assumptions you could add that would make that not a problem, but I'm surprised they'd hold.