| It's not wrong, it's just over-simplified. It is true that you can only squeeze 100% of the maximum possible useful compute out of a NUMA system with methods like the article author was suggesting. The less coordination there is between cores, the less cross-core or cross-socket communication is needed, all of which is overhead. Caveat: If a bunch of independent processes are processing independent data, they'll increase cache thrashing at L2 and higher levels. Synchronised threads running the same code more-or-less in lockstep over the same areas of the data can benefit from sharing that independent processes can't. In some scenarios, this can be a huge speedup -- just ask a GPU programmer! Where the process-per-core argument definitely stops being a good approach is when you start to consider latency. Literally just this week, I need to help someone working on a Node.js app that needs to pre-cache a bunch of very expensive computations (map tiles over data changing on an interval). Because this is CPU-heavy and Node.js is single-threaded, it kills the user experience while it is running. Interactive responses get interleaved with batch actions, and users complain. This is not a problem with ASP.NET where this kind of work can simply run in a background thread and populate the cache without interfering with user queries! For similar reasons, Redis replacements that use multi-threading have far lower tail latencies: https://microsoft.github.io/garnet/ |