|
|
|
|
|
by jandrewrogers
1911 days ago
|
|
Data sharing between cores is the bane of parallel programming at every scale. Consequently, high-performance parallel systems are not multithreaded in any meaningful sense -- each thread operates almost exclusively on private local data. See also "thread-per-core" software architectures for maximizing absolute throughput, which has its origins in supercomputing. However, this creates a new problem for any workload that is not embarrassingly parallel. If all data is private to a single thread/core then the workload can become very unevenly distributed across cores depending on the data they own, destroying efficient parallelism in another way. Unlike multithreading, quick adaptive load shedding to smooth out these hotspots can scale surprisingly well across a very large number of cores/nodes with little overhead for many workloads. This is how many massively parallel codes are written today for workloads that are not embarrassingly parallel. Partitioning data across cores with out sharing is necessary to maximize throughput, and almost always better than multithreading, but insufficient. Fortunately, mitigating transient hotspots is a mostly solved problem. |
|
In a multi-node cooperative setting you need some way to transmit information that a given node is overloaded, some way to find nodes that have available capacity, and a low overhead way to shift the work over to them. If the work to be done depends solely on data that you have local to you, it seems silly to shift the data as well (depending on how big it is); this would only make sense when you have to combine data from a variety of nodes (which can be done on any other node, assuming that the current node is overloaded).
Probably not worth doing anything about short lived hotspots (<1s). I wonder what kind of granularity you have used in your systems (probably different for within node vs across nodes).