| HN Mirror

Yes, something different than work-stealing on a single node. Instead of pulling work from other cores when idle, cores push the data that is attracting work to other cores. Even though shedding data that is attracting work is not the same as shedding work per se, the effect is equivalent in real systems if you can shed it with very low latency.

The last several setups I've seen worked this way. This has lower overhead and significantly less thread contention than work-stealing. The hotspot granularity was pretty coarse a decade ago -- on the order of a few seconds -- because the algorithms and techniques weren't as good and the overhead was relatively high. Modern versions have some tricks that allow them to shed load so quickly (millisecond range) and with so little overhead that it just looks really smooth to have it proactively running all the time. The system I am currently working with is designed to shed thousands of megabyte-scale data chunks per second without interfering with throughput but you typically need quite a bit less to keep things balanced.

The ideas were originally developed for large parallel clusters running workloads inherently prone to hotspotting. To your point is a more complex problem and operates at coarser time granularities, both out of necessity and because hotspots form more slowly anyway. It turns out it works even better when applied at the scale of a large server. The concept is pretty simple in the abstract, the challenge is arriving at an efficient implementation with few edge cases that also can provide strong ordering guarantees. Quite a few slightly broken but usable designs were put into production, including by myself, before people started to grok the design idioms for the single-node case.

An important point is that unlike work-stealing, you can't sprinkle this on top of a system that wasn't designed for it. Implementations lean heavily on schedule control mechanisms, which most software architectures don't provide.