If you're looking for weird synchronization primitives, look at the documentation of the DMA controller. It has a mode in which it stores bytes that are written to a particular address in a memory range in order the writes arrive. I haven't figured out a reasonable way to use that with multiple writers (except the trivial case of having a byte-based stream with bounded size), though.
Yeah, I was thinking about that problem too. (It's not safe to blindly write somewhere unless you can be sure that nobody else is going to simultaneously clobber your data. You can't do any kind of atomic test-and-set or compare-and-swap operation on remote memory, so you don't have the usual building blocks for things like queues or semaphores.)
The problem becomes a lot easier if you can reduce the multiple-writer case to the single-writer case. One idea that occurred to me is that since you have 1024 cores, it might make sense to dedicate a small fraction of them (say, 1/64) to synchronization. When you need to send a message to another process, you write to a nearby "router" that has a dedicated buffer to receive your data. The router can then serialize the with respect to other messages and put it into the receiver's buffer.
Basically, you'd end up defining an "overlay network" on top of the native hardware support; you pay a latency cost, but you gain a lot of flexibility.
EDIT: I may be completely wrong about the first paragraph; it looks like the TESTSET instruction might actually be usable on remote addresses. I assumed it didn't because the architecture documentation doesn't say anything about how such a capability would be implemented. But if it works, it would drastically simplify inter-node communication.
IIRC TESTSET is usable: IIRC it just sends a message that causes that to happen, but you don't learn if the test succeeded.
I was talking about the DMA mode in which every write to special register (that may be coming from a different core) gets "redirected" to subsequent byte of the DMA target region. This can work as a queue with multiple enqueuers, but has bounded size (after the size is exhausted, messages get lost) and operates on single byte messages.