Hacker News new | ask | show | jobs
by bee_rider 956 days ago
The NUMA nature of recent* chips has made me wonder if there’s ever going to be a movement to start using message passing libraries (like MPI) on shared memory machines.

* actually, not even that recent, Zen planted this hope in my brain.

7 comments

Thread-per-core software architectures are doing this https://penberg.org/papers/tpc-ancs19.pdf

Real world examples are scylladb and Redpanda, both built on the seastar framework (C++ https://seastar.io/message-passing/).

And for rust there is glommio https://www.datadoghq.com/blog/engineering/introducing-glomm...

There is also another thread-per-core implementation by ByteDance (TikTok) for Rust called Monoio with benchmarks[0] comparing it to Tokio and Glommio.

[0] https://github.com/bytedance/monoio/blob/master/docs/en/benc...

Does thread per core necessarily imply message passing? I don't see why the two need to be related.
The thread-per-core manifesto has a goal of not sharing data between cores, and thus the communication inside the process becomes message passing, handing off ownership of a chunk of data to the recipient core. This lack of sharing is what enables the performance (no locks etc needed, outside of the message passing).

This is a good watch (first half is pure background, second half talks about the motivation): https://www.youtube.com/watch?v=PbgTyCSDPrs

In HPC it's common to do a mix of MPI (message-passing / distributed memory) and OpenMP (shared memory) parallelism when running on big multicore (and obviously multi-node) machines. It helps with locality, among other things.
This is what I do actually, and it works fairly well. Currently I do one MPI process per socket, but mostly just because the OpenMP code I’m calling is a library, and it doesn’t seem to scale well past one modern Xeon worth of cores.

I don’t know what I’d do if I had an old Zen machine, maybe map an MPI process to each chiplett.

My impression is that in the first generation Zen machines, the cost of communicating from one chiplett to another was really quite significant, but they’ve made good enough progress there that it is only something that the really hardcode folks care about.

Nope, Parallela was the wrong thing at the time and it's still wrong. Cache is good.
> Nope, Parallela was the wrong thing at the time and it's still wrong.

Can you elaborate?

It didn't have DRAM or caches. Programming with scratchpads is so difficult that people just give up.
Scratchpads are the memory equivalent of VLIW.

Processors can extract parallelism dynamically at runtime. They can also manage your memory automatically at run time. Better yet, they can utilize hardware resources instead of software resources. It is such an obvious win.

Are you saying that scratchpads were like VLIW in the sense that they seemed like a really cool idea, but failed because people didn’t want to manually code things well enough to take advantage of them?
Absolute statements are bad.
A good recent paper on implementing message passing over shared memory:

"Message Passing or Shared Memory: Evaluating the Delegation Abstraction for Multicores"

https://cs.brown.edu/~irina/papers/2013-opodis.pdf

I don't know where you got that idea from. There is a movement in the complete opposite direction with CXL. Don't waste your time with silly libraries, serialisation or networking. Have a rack that is filled with nothing but memory pooled RAM and then connect your servers (which still retain RAM as a L4 cache). You now have a huge shared memory machine with distributed CPUs using CXL for cache coherence accross the entire system. There have been benchmarks that kept 75% of the memory outside the server and the performance degradation was only 10% compared to keeping the entire data set on a single server.
> There have been benchmarks that kept 75% of the memory outside the server and the performance degradation was only 10% compared to keeping the entire data set on a single server.

Performance degradation would greatly depend on how much data was actually touched by the workload outside the server and not solely by the fact that 75% of the memory was attached through CXL, no?

NUMA latency I measured last time on a dual-socket Xeon (Haswell) system was around 130ns for non-local memory access and 90ns for local memory access. OTOH some numbers I found seem to imply that the CXL latency is ~200ns.

This means that on average CXL latency is almost 100% larger than NUMA so I think it is not realistic to have only 10% performance degradation unless most of your workload fits into L1/L2/L3 cache plus that 25% of local memory or your workload is more CPU bound rather than memory bound.

I keep thinking that Rust’s borrow semantics would be pretty good for hinting whether code should run on the same core or could be offloaded to another. Two modules that only communicate via small, read only messages could easily be on separate cores.

And on architectures where some cores share faster paths than others, gradations could be scheduled that way.

IMO, MPI is the wrong level to do this on. Most apps should either be using some form of mapreduce or not using parallelism beyond the numa node.
The map-reduce programming model is overly simplified. It cannot express useful primitives such as prefix scan, which is used all the time in parallel algorithms.