| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kache_ 2052 days ago

When deciding what mechanism to employ to load shed, you should keep in mind the layer at which you are load shedding. Modern distributed systems are comprised of many layers. You can do it at the load balancer, at the OS level, or in the application logic. This becomes a trade-off. As you get closer to the core application logic, the more information you will have to make a decision. On the other hand, as you get closer, the more work you have already performed and the more cost there is to throwing away the request.

You may employ techniques more complex than a simple bucketing mechanism, such as acutely observing the degree at which clients are exceeding their baseline. However, these techniques aren’t free. The cost of simply throwing away the request can overwhelm your server - and the more steps you add before the shedding part the lower the maximum throughput you can tolerate before going to 0 availability. It’s important to understand at what point this happens when designing a system that takes advantage of this technique.

For example, If you do it at the OS level, it is a lot cheaper than leaving it to the server process. If you choose to do it in your application logic, think carefully about how much work is done for the request before it gets thrown away. Are you validating a token before you are making your decision?

1 comments

jeffbee 2052 days ago

You touch on the key thing that people sometimes overlook. Whatever you are doing to serve errors has to be strictly less expensive than serving successes. If your load shedding error path does things like logging synchronously to a file (as you might get from a logging library that synchronizes outputs for warnings and errors, but not information), taking a lock to update a global error counter, or formatting stack traces in exceptions, it's possible that load shedding will _cause_ the collapse of your service instead of preventing it.

link

joatmon-snoo 2052 days ago

+1 additionally, if you end up in a scenario where you don't even have enough capacity in a given layer to fail quickly, your only options are either increase capacity or throttle load pre-server (either in the network or clients)

link