Hacker News new | ask | show | jobs
by electricshampo1 1913 days ago
For this load shedding in a single node, are you imagining something more than work stealing style approaches?

In a multi-node cooperative setting you need some way to transmit information that a given node is overloaded, some way to find nodes that have available capacity, and a low overhead way to shift the work over to them. If the work to be done depends solely on data that you have local to you, it seems silly to shift the data as well (depending on how big it is); this would only make sense when you have to combine data from a variety of nodes (which can be done on any other node, assuming that the current node is overloaded).

Probably not worth doing anything about short lived hotspots (<1s). I wonder what kind of granularity you have used in your systems (probably different for within node vs across nodes).

1 comments

Yes, something different than work-stealing on a single node. Instead of pulling work from other cores when idle, cores push the data that is attracting work to other cores. Even though shedding data that is attracting work is not the same as shedding work per se, the effect is equivalent in real systems if you can shed it with very low latency.

The last several setups I've seen worked this way. This has lower overhead and significantly less thread contention than work-stealing. The hotspot granularity was pretty coarse a decade ago -- on the order of a few seconds -- because the algorithms and techniques weren't as good and the overhead was relatively high. Modern versions have some tricks that allow them to shed load so quickly (millisecond range) and with so little overhead that it just looks really smooth to have it proactively running all the time. The system I am currently working with is designed to shed thousands of megabyte-scale data chunks per second without interfering with throughput but you typically need quite a bit less to keep things balanced.

The ideas were originally developed for large parallel clusters running workloads inherently prone to hotspotting. To your point is a more complex problem and operates at coarser time granularities, both out of necessity and because hotspots form more slowly anyway. It turns out it works even better when applied at the scale of a large server. The concept is pretty simple in the abstract, the challenge is arriving at an efficient implementation with few edge cases that also can provide strong ordering guarantees. Quite a few slightly broken but usable designs were put into production, including by myself, before people started to grok the design idioms for the single-node case.

An important point is that unlike work-stealing, you can't sprinkle this on top of a system that wasn't designed for it. Implementations lean heavily on schedule control mechanisms, which most software architectures don't provide.