Hacker News new | ask | show | jobs
by ptarjan 3363 days ago
Hi! I'm the author of the post and am happy to answer any questions you have.

There is also a corresponding set of code examples at the bottom that might be of interest to you: https://gist.github.com/ptarjan/e38f45f2dfe601419ca3af937fff...

6 comments

The code examples are written in ruby and lua - I take it that in your live environment you use the lua script compiled into nginx?

Does this mean every rate limited request is hitting redis?

I had this problem years ago, and I used varnish to offload the rate limited traffic, which scaled very well then. Here is a blog post I wrote about it: http://blog.dansingerman.com/post/4604532761/how-to-block-ra...

(it was written directly in response to a 'One of your users has a misbehaving script which is accidentally sending you a lot of requests' issue)

We actually use the LUA scripts on Redis to guarantee the atomicity that the token bucket algorithm needs. Then the Ruby middleware does the HTTP 429 response logic. The code samples I provided are pretty close to what we use in production.

So yes, our Redis traffic is strictly higher than our whole API requests per second. This is actually pretty easy since Redis scales pretty well horizontally (the keys shard well) in addition to being able to handle many orders of magnitude more traffic than a normal web stack per machine.

Thanks for linking to your blog post. I like the simplicity of not having to enter the web stack at all. I'm not very familiar with varnish, how do you end up doing the actual rate limiting algorithm? I see you making a hash but not dripping tokens into a bucket.

The rate limiting was done by the application server, but was completely separate from the varnish layer. (I think it was something fairly naive to do with memcached, but that's not terribly important). What was important was that the application server could serve the 400* with an expiry, and varnish would deal with that load until the expiry.

*This was 2011. The 429 response code wasn't defined until 2012.

That's an incredibly simple and elegant solution, and it's obvious how simple it is to scale it. Thanks for sharing!
A few questions. Have you looked at doing batch token assignment or "leases" from the centralized source to the workers? Ie request 10 tokens from the central "bucket" when the local "bucket" reaches zero. You can dramatically reduce the number of remote operations, reducing syncronization latency and cost. Scaling the batch size based on past request rate is a nice further optimization.

Similarly Im not sure about your "atomicity requirements." Is percieved accuracy driving the centralized store and single token semantics? Time skew alone is going to drive 0.1-1% innaccuracy in your rate limiters unless youre in crazy town doing 1pps or ptp everywhere.

Some food for thought is consolidating "rate" and "concurrent" limitations. Youll frequently see the same dichotomy in pps and bps based limits. If you can derive a consistent unit of work you can use a single bucket. Ie 10 requests that take 1 cpu second each have the same "cost" as 1 request at 10 cpu seconds. In the macro queue theory has them as the same impact to the system.

You "fleet usage" case, and a few other comments, reads like a simulacrum of weighted queues. Expanding it to weighted priority queues is a really powerful abstraction. Imagine requests which are over their base rate limit are not discarded, but actually marked as the lowest priority. That allows you to do generalized strategies like rejecting requests (load shedding) based on queue depth (latency) and FIFO/FILO.

PS: everyone reading _please_ respond with a 400 series if the request fails due to client request properties like rate. It really really really sucks to try and guess if 503 means "I" was rate limited and should back off, or "you" failed and the request should be immediately retried.

PPS A tactic to consider is returning the estimated recharge time when a request is throttled. A good client can use that to intelligently back of and achieve homeostasis with your server resources.

Too late to edit, but thought I should clarify. Nice introductory post and demo code for the concept. I wrote the parent in a hurry after work. I meant it as a mix of earnest questions and comments that came to me as I read through. Apologies if it came across as unnecessarily challenging.
Thanks for including real numbers, e.g. "Only 100 requests were rejected this month from this rate limiter". Always interesting to see how much a particular trapdoor actually had to be opened.
I find it also helpful to see the number is above zero and the code paths are actually exercised regularly, not just in rare failure scenarios.
This is an incredibly well written, helpful article. Thank you! I love that you included the Gist as well to show some real code, and that you mentioned how often each type of rate limiting is triggered. This is like the best example ever of how to write a tech blog post!
Thanks so much. I'm really glad you liked it.
I'm curious what your thoughts are on implementing this via a reverse proxy in front of your services. In terms of implementation it appears to me much simpler. But it obviously comes at the cost of maintaining the proxy if you are, say, already using ELBs.
Before I read this article (having just seen the title) I assumed this was going to be covering a reverse proxy system, and started pondering how I would implement something like this myself. A very basic approach probably would be simpler than middleware, but the question I got stuck on was whether or not a reverse proxy would need to be "smart" about the content it was limiting. That is, would it need to be programmed to parse requests and understand what the intention and purpose was of each request?

For example, limiting traffic by IP address via reverse proxy is simple, but it seems like it would be more difficult to limit by request priority.

I was surprised when the article revealed it was middleware, and suddenly that made a lot more sense and seemed easier because it no-longer requires duplicating application logic to understand the content of requests. Middleware definitely seems like the better approach to me after these considerations.

What kinds of methods would one use to solve the problem of needing to parse and understand incoming requests using the reverse proxy method?

From the bit I know, I think there are at least some options available. For example, in nginx I believe you can use the "map" module to mix and match different components of the request into your limiting.

From what I saw, HAProxy appears to be even more powerful. Its ACL concepts are completely able to create rules based on headers, IPs in the request, etc. and you can compose them into larger ACLs.

With the example of request priority, if you can determine the priority by it's URL or a header, let's say, you can achieve this with nginx. But if you need to look the user up in a DB and see how much they're paying you, you obviously have to do it in the application.

This can be done either way, and is really infrastructure dependent. There are pros and cons on both sides. We opted to bake it into the web stack instead of in front so that we get all the benefits of existing infrastructure like log aggregation, exception tracing, HTTP error response formatting and tracking, user-specific gating, auto service scaling, etc.
Without entirely ruling out the opinion, you can take any advice that such logic must take place in a reverse proxy in front of the app with a grain of salt. It's one thing to place certain anti-DOS/DDOS logic in front of the app, but basic rate limiting that depends on inspection of each request (including cache/database lookups for per-user limit increases) is certainly something that is very commonly found at the application middleware layer (after user authentication and route parsing, but before the controller).

Rate limiting is not meant to be a hard defence against denial of service attacks - that is an entirely separate engineering problem, where even a reverse proxy is not enough protection when you're facing a network-level flood. The primary purpose of rate limiting isn't to prevent hitting the application at all - it's to prevent hitting the heavy logic that starts when the controller is invoked. As long as your application's bootstrapping layer is lightweight, there is no problem with leaving rate limiting as part of the application.

imo, this sort of middleware (api management) should always be handled by a reverse proxy
Could anyone give me a quick overview of how that would work? Or what it would simplify? (Just curious!)
I have been working on this recently, using nginx as the reverse proxy. Nginx makes it pretty trivial to limit requests based on various factors, like request rate, number of connections, etc. A post like https://lincolnloop.com/blog/rate-limiting-nginx/ has snippets that show how it works.

But the simplification is that you do not need to write any application code. I can very easily have certain routes or domains limited differently by updating the nginx config. Especially in a microservices world I am trying to avoid having to update N services to get all of them to be rate limited.

Here is [1] the source code to a rate-limiting middleware written for the Caddy webserver if you're into reading Go code. Should be a good sample.

The whole thing is 27 unique files, 1068 lines of Go, according to cloc.

[1]: https://github.com/xuqingfeng/caddy-rate-limit/blob/master/c...

is token bucket the best algorithm for this? why not ring buffered timestamps?

ex: https://github.com/blendlabs/go-util/blob/master/collections...

Both should work. Token buckets only have to store two things (count and timestamp) where a timestamp set has to store all N timestamps. I like the simper approach since I don't actually need access to all the timestamps.

If you look at the concurrent request limiter I do indeed keep all the timestamps there in a Redis set. That one was more error prone to write in practice as I often would accidentally hit Redis storage limits.

~previous statement was incorrect~
In a token bucket algorithm you don't actually have a separate replenish step, it is baked into the next check you do. Similar to how in the example you linked the removal of the old entries is baked into the check step.

Check out my example: https://gist.github.com/ptarjan/e38f45f2dfe601419ca3af937fff... on line 23 with

    local filled_tokens = math.min(capacity, last_tokens+(delta*rate))
that adds in any tokens that should drip into the bucket since the last check.
sorry was thinking of something else.

yeah the algo is pretty standard, what we found is some edge cases get super weird, namely if you check on regular intervals you'll get false positives, vs. bursty calls

https://play.golang.org/p/Ujp7yeFQ3L

There is no replenishing process, its just a conceptual thing. In code one just computes time difference and then multiplies that with fill rate to determine tokens to fill before deducting any. Token buckets are popular because they work well, require fixed size storage and are extremely simple to implement and test.