Hacker News new | ask | show | jobs
by miggy 294 days ago
We had a critical service that often got overwhelmed, not by one client app but by different apps over time. One week it was app A, the next week app B, each with its own buggy code suddenly spamming the service.

The quick fix suggested was caching, since a lot of requests were for the same query. But after debating, we went with rate limiting instead. Our reasoning: caching would just hide the bad behavior and keep the broken clients alive, only for them to cause failures in other downstream systems later. By rate limiting, we stopped abusive patterns across all apps and forced bugs to surface. In fact, we discovered multiple issues in different apps this way.

Takeaway: caching is good, but it is not a replacement for fixing buggy code or misuse. Sometimes the better fix is to protect the service and let the bugs show up where they belong.

3 comments

It's funny how I encountered a problem which went exactly the opposite way! We initially introduced a rate limiter that was adequate for the time, but with the product scaling up it stopped being adequate, and any failures with 429 were either ignored, or closed as client bugs. Only after some time we realized that the rate of requests scaled up approximately with the rate of product growth, and a quick fix was to simply remove the limiter, but after a couple of times when DB decided to take a nap after being overwhelmed, we added a caching layer.

Just goes to show that there is no silver bullet - context, experience and good amount of gut feeling is paramount.

Something that was drilled into me early in my career was that you cannot expect your cache to be up 100% of the time. The logical extension of that is your main DB needs to be able to handle 100% of your traffic at a moment’s notice. Not only has this kind of thinking saved my ass on several occasions, but it’s also actually kept my code much cleaner. I don’t want to say rate limiters and circuit breakers are the mark of bad engineering, butttt they’re usually just good engineering deferred.
Reminds me of gas plumbing, the indoor lines are only a few psi above ambient, but the lines themselves have to take line pressure to 300psi is case the regulator fails. It's good advice!
I guess CPUs are pretty buggy with all their caches. If only the hardware people could fix their buggy systems.

In all seriousness sometimes a cache is what you need. Inline caching is a classic example.

There are times when a cache is appropriate, but I often find that it's more appropriate for the cache to be on the side of whoever is making all the requests. This isn't applicable when that is e.g. millions of different clients all making their own requests, but rather when we're talking about one internal service putting heavy load on another one.

The team with the demanding service can add a cache that's appropriate for their needs, and will be motivated to do so in order to avoid hitting the rate limit (or reduce costs, which should be attributed to them).

You cannot trust your clients. Period. It doesn’t matter if they’re internal or external. If you design (and test!) with this assumption in mind, you’ll never have a bad day. I’ve really never understood why teams and companies have taken this defensive stance that their service is being “abused” despite having nothing even resembling an SLA. It seemed pretty inexcusable to not have a horizontally scaling service back in 2010 when I first started interning at tech companies, and I’m really confused why this is still an issue today.
I fully agree. The rate limits are how you control the behaviour of the clients. My suggestion of leaving caching to the clients, which they may want to do in order to avoid hitting the rate limit.
>why teams and companies have taken this defensive stance that their service is being “abused” despite having nothing even resembling an SLA.

I mean because bad code on a fast client system can cause a load higher than all other users put together. This is why half the internet is behind something like cloudflare these days. Limiting, blocking, and banning has to be baked in.

You can never trust clients to behave. If your goal is to reduce infra cost, sure, rate limiting is an acceptable answer. But is it really that hard to throw on a cache and provision your service to be horizontally scalable?
Scaling matters, but why pay for abusive clients or bots? Adding a cache is easy; the hard part is invalidation, sync, and thundering herd. Use it if the product needs it, not as a band-aid.