Thinking this could be useful in a multi tenants service where you need to fairly allocate job processing capacity across tenants to a number of background workers (like data export api requests, encoding requests etc.)
That was my first thought as well. However, in a lot of real world cases, what matters is not the frequency of requests, but the duration of the jobs. For instance, one client might request a job that takes minutes or hours to complete, while another may only have requests that take a couple of seconds to complete. I don't think this library handles such cases.
Lots of heuristics continue to work pretty well as long as the least and greatest are within an order of magnitude of each other. It’s one of the reasons why we break stories down to 1-10 business days. Anything bigger and the statistical characteristics begin to break down.
That said, it’s quite easy for a big job to exceed 50x the cost of the smallest job.
defining a unit of processing like duration or quantity and then feeding the algorithm with the equivalent of units consumed (pre or post processing a request) might help.
To mitigate this case you could limit capacity in terms of concurrency instead of request rate. Basically it would be like a fairly-acquired semaphore.
I believe nginx+ has a feature that does max-conns by IP address. It’s a similar solution to what you describe. Of course that falls down wrt fairness when fanout causes the cost of a request to not be proportional to the response time.