| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by belgesel 1956 days ago
	A question to parallel programming experts. Can we assume that a task that is well seperated to perform on single core is always better than multi-core? I meant using multiple instances working on single core vs. single instance with multi threads.

3 comments

jandrewrogers 1956 days ago

Data sharing between cores is the bane of parallel programming at every scale. Consequently, high-performance parallel systems are not multithreaded in any meaningful sense -- each thread operates almost exclusively on private local data. See also "thread-per-core" software architectures for maximizing absolute throughput, which has its origins in supercomputing.

However, this creates a new problem for any workload that is not embarrassingly parallel. If all data is private to a single thread/core then the workload can become very unevenly distributed across cores depending on the data they own, destroying efficient parallelism in another way. Unlike multithreading, quick adaptive load shedding to smooth out these hotspots can scale surprisingly well across a very large number of cores/nodes with little overhead for many workloads. This is how many massively parallel codes are written today for workloads that are not embarrassingly parallel.

Partitioning data across cores with out sharing is necessary to maximize throughput, and almost always better than multithreading, but insufficient. Fortunately, mitigating transient hotspots is a mostly solved problem.

link

electricshampo1 1956 days ago

For this load shedding in a single node, are you imagining something more than work stealing style approaches?

In a multi-node cooperative setting you need some way to transmit information that a given node is overloaded, some way to find nodes that have available capacity, and a low overhead way to shift the work over to them. If the work to be done depends solely on data that you have local to you, it seems silly to shift the data as well (depending on how big it is); this would only make sense when you have to combine data from a variety of nodes (which can be done on any other node, assuming that the current node is overloaded).

Probably not worth doing anything about short lived hotspots (<1s). I wonder what kind of granularity you have used in your systems (probably different for within node vs across nodes).

link

jandrewrogers 1956 days ago

Yes, something different than work-stealing on a single node. Instead of pulling work from other cores when idle, cores push the data that is attracting work to other cores. Even though shedding data that is attracting work is not the same as shedding work per se, the effect is equivalent in real systems if you can shed it with very low latency.

The last several setups I've seen worked this way. This has lower overhead and significantly less thread contention than work-stealing. The hotspot granularity was pretty coarse a decade ago -- on the order of a few seconds -- because the algorithms and techniques weren't as good and the overhead was relatively high. Modern versions have some tricks that allow them to shed load so quickly (millisecond range) and with so little overhead that it just looks really smooth to have it proactively running all the time. The system I am currently working with is designed to shed thousands of megabyte-scale data chunks per second without interfering with throughput but you typically need quite a bit less to keep things balanced.

The ideas were originally developed for large parallel clusters running workloads inherently prone to hotspotting. To your point is a more complex problem and operates at coarser time granularities, both out of necessity and because hotspots form more slowly anyway. It turns out it works even better when applied at the scale of a large server. The concept is pretty simple in the abstract, the challenge is arriving at an efficient implementation with few edge cases that also can provide strong ordering guarantees. Quite a few slightly broken but usable designs were put into production, including by myself, before people started to grok the design idioms for the single-node case.

An important point is that unlike work-stealing, you can't sprinkle this on top of a system that wasn't designed for it. Implementations lean heavily on schedule control mechanisms, which most software architectures don't provide.

link

gpderetta 1956 days ago

Sharing read only data is not an issue at all. Sharing data written by only one thread also scales very well: The single writer principle is well understood and critical for designing efficient multi-threaded applications.

link

jandrewrogers 1956 days ago

Interestingly, not sharing read-only data is a well-known optimization for parallel code -- it can famously produce a super-linear scaling effect. While counterintuitive, the reason is mundane.

The majority of high-performance code is limited by memory bandwidth, not CPU or I/O. Consequently, throughput is a function of the aggregate CPU cache efficiency of the entire system. If read-only data is aggressively shared across cores, it means a large part of the total CPU cache capacity of the system is being used to store the same data multiple times, thereby wasting memory bandwidth and reducing system throughput.

If you do not share any read-only data, it maximizes the amount of the workload resident in the various CPU caches. This can have a significant performance effect in real parallel systems.

link

imtringued 1956 days ago

The only way to achieve perfect linear scaling is to run N threads that never talk to each other, because a single thread program doesn't communicate with other threads.

Whether you run processes or threads shouldn't matter. As soon as you do communication, expect your program to get slower than what is theoretically possible. Multiprocess tends to have higher communication overhead (not always).

link

aparsons 1956 days ago

If you flip some NUMA switches so tasks on the other core don’t add overhead, I’m inclined to say yes. But then modern schedulers are also so low-overhead that this case really only applies if you have a hard real-time requirement - in which case dedicated hardware is the norm

link