|
|
|
|
|
by jillesvangurp
111 days ago
|
|
I have a 16 core M4 Max and running at a fraction of the potential maximum speed just isn't very optimal on modern CPUs like that. Threading is hard, especially if they share a lot of state. Memory management with multiple threads sharing stuff is hard and ideally minimized. What is optimal very much depends on the type of workload as well. Not all workloads are IO dependent, or require sharing a lot of state. Using threads for blocking IO on server requests was popular 20 years ago in e.g. Java. But these days non blocking IO is preferred both for single and multi threaded systems. E.g. Elasticsearch uses threading and non blocking IO across CPU cores and cluster nodes to provide horizontal scalability for indexing. It tends to stick to just one indexing thread per CPU core of course. But it has additional thread pools and generally more threads than CPU cores in total. A lot of workloads where the CPU is the bottleneck that have some IO benefit from threading by letting other threads progress while one is waiting for IO. And if the amount of context switching can be limited, that can be OK. For loads that are embarrassingly parallel with little or no IO and very limited context sharing, a 1 thread per CPU core tends to be the most optimal. It's really when you start having more than threads than cores that context switching becomes a factor. What's optimal there is very much dependent on how much shared state there is and whether you are IO or CPU limited. In general, concurrency and parallelism tend to be harder in languages that predate when threading and multi core CPUs were common and lack good primitives for this. Python only recently started addressing the GIL obstacle and a big motivation for creating Rust was just how hard doing this stuff is in C/C++ without creating a lot of dead locks, crash bugs, and security issues. It's not impossible with the right frameworks, a lot of skill and discipline of course. But Rust is getting a well deserved reputation for being very optimal and safe for this kind of thing. Likewise functional languages like Elixir are more naturally suited for running on systems with lots of CPUs and threads. |
|
To further muddy the waters: if your process is not bottlenecked at the CPU a modern unit might be more optimal in terms of power draw (directly and through secondary effects for increased cooling needs) running at a fraction of its speed. Moving at a low clock but fast enough not to become the bottleneck compared to other factors, instead of bursting to full speed for a bit then waiting, can be optimal.
Of course there are a bunch of chip specific optimisations here if you like complexity. Some chips are better off running all cores slowly, and others that can completely power down idle cores better off running a few faster, to optimise power use while getting the same job done in the same amount of wall-clock time.