Hacker News new | ask | show | jobs
by hxtk 373 days ago
Something I’ve long wondered is why you never hear about rate limiting algorithms that are based on the cost to serve the request or algorithms that dynamically learn the capacity of the system and give everyone a fair share.

In the field of router buffer management, there are algorithms like Stochastic Fair Blue, which does the latter, but is somewhat hard to apply to HTTP because you’d have to define a success/failure metric for each request (a latency target, for example), and clients would have to tolerate a small probability a request being rejected even when they’re not being rate limited.

In Google’s paper on the Zanzibar authorization system, they give brief mention to rate limiting clients based on a CPU time allocation, but don’t go into any detail since it’s not a paper on rate limiting.

It’s something that matters less today with ubiquitous autoscaling, where the capacity of the system is whatever you need it to be to give each user what they ask for up to their rate limit, but I’m surprised at my inability to find any detailed account of such a thing being attempted.

6 comments

One of those times when I was still learning that asking forgiveness is easier than asking permission, I wanted to eliminate a very expensive presence calculation that I and a coworker determined were accounting for almost 10% of average page load time. Some idiot in product has decided they wanted an OLTP-ish solution that told you -exactly- how many people were online and like a fool I asked if we could do a sane version and they said no. If you don't ask, then it's not insubordination.

For situations where eventual consistency is good enough, you can run a task in a loop that tries every n seconds to update a quantity. But as you say that can also saturate, so what you really want is for the task to update, then wait m seconds and go again, where m is more than the time you expect the task to complete in (<<50% duty cycle). As the cost of the operation climbs the time lag increases but the load on the system increases more slowly. If you want to, collect telemetry on how often it completes and set up alarms for if it doesn't for a duration that is several times longer than your spikiest loads happen.

I don't think voluntary rate limiting on the client side gets enough column inches. Peer to peer you end up footguning yourself if you bite off more than you can chew, and if you start stalling on responses then you gum up the server as well.

Shuffle-sharding is similar to stochastic Blue stuff, and you'll find Amazon talking about it:

https://aws.amazon.com/builders-library/workload-isolation-u...

Which isn't exactly what you're talking about, but between that and other things in the "Builder's Library" series, you can see that people are doing this, and writing about it.

Envoy has a latency-based adaptive concurrency feature: https://www.envoyproxy.io/docs/envoy/latest/configuration/ht...

Netflix has a blog post for their implementation: https://netflixtechblog.medium.com/performance-under-load-3e...

Yes, autoscale is a thing but it's rarely instantaneous ; you'll still benefit from having a good handle on load fairness.

Furthermore, modern GPU workloads are way less elastic in capacity scaling

I mentioned this in another place on this thread, but a simple AIMD algorithm paired with a token bucket is surprisingly effective at dynamically adjusting to available capacity, even across a fleet of services not sharing state other than the contended resource.

Pretty easy to pair AIMD with token bucket (eg https://github.com/webriots/rate)

I notice that library self-identifies a problem of generally keeping a fixed-size number of buckets and addressing them with hashing, which leads to great memory usage but introduces a risk of collision.

Stochastic Fair Blue shares that problem, so I think you might find the solution it uses interesting: it rotates hashes on fixed intervals by re-seeding to ensure that if a responsive flow collides with a non-responsive flow that's being rate-limited, it is at least guaranteed not to do it for very long.

This leads to a problem similar to the problem with Fixed Window Counter rate limiting where it would basically forget the rate limit history every interval. To solve this, two queues are fed input data at a time: the router enforces one of them, while just feeding the other data and ignoring its output in order to warm up its buckets.

I imagine if I tried to make use of the library you linked and didn't absolutely need per-user granularity, I'd do something similar by concatenating the user's identifier with a seed value for each of the two queues and rotating at regular intervals.

Love the simplicity and elegance of the design to address this problem.
My assumption would be that it is a complexity thing. As a consumer of the service having a rate limit that is easy to understand and write retry logic for is a big plus. If the criteria is "x requests per 5 minute window" and I start getting rate limit errors it's very clear what back off behaviour I need to implement. If the criteria is CPU usage of my requests, as a consumer it's hard for me to reason about how much CPU a given request is going to take so my retry logic is going to be fairly dumb.