| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by psobot 4921 days ago
	The actor model and data immutability are good for programmer productivity and reducing errors, but not for writing highly performant or low-level code. Even then, someone has to write the message queue, and how do you plan do to that without any mutable shared state?

2 comments

klodolph 4920 days ago

There's an enormous assumption here -- the assumption that we are on a shared memory computer to begin with. At the lowest level, memory devices are implemented using "messages" that go over wires, and locks are implemented on top of that.

But it's a feature of small systems that you THINK you have a flat address space and not a message passing system. CPU cores send messages to L1 caches to fetch memory, and the messages go outward to L2 and RAM, all in hardware. With a modern OS kernel this message will often be intercepted and fulfilled with disk IO.

So if you pretend you have a flat address space you won't be miserable or anything, because the entire system (hardware + software) is designed to make it look that way.

Once you get interested in performance details the flat memory space abstraction goes out the window. L1, L2, RAM, and disk latencies are really message passing latencies for messages passed between physically separate devices. Designing your algorithm to minimize the number of messages passed between devices (a.k.a. the number of "cache misses") will improve its performance. Optimizations meant to minimize message passing can have unintended consequences. For example, Linux once allocated memory on the region of RAM closest to the core running the process which requested the memory -- I'm sure it made some benchmarks run faster, but it turned into an absolute disaster for MySQL, because it meant that MySQL would only use one region of RAM. (Google "MySQL NUMA" for the full story. You can probably imagine most of it if you just think about the consequences of mixing on-die memory controllers with multi-socket systems.)

The degree to which the flat memory abstraction applies is related to the size of the system. An single-core, embedded microcontroller with 256 bytes of RAM really does have a flat address space and you can often count processor cycles just by looking at the assembly, the abstraction is basically perfect. A modern desktop with a quad-core processor is going to act less like an ideal shared memory machine because the message passing overhead of using contested cache lines can affect performance. And a world-class super computer might be split into N units of M cards of K chips with L cores on each chip -- the shared memory abstraction will only extend so far before it's all explicit message passing in software.

So the real question is not "how do you plan to write a queue without mutable shared state?" The real question is, "How do you implement mutable shared state using messages?"

Footnote: I think it's very telling... the shared memory abstraction is so convenient, the work behind the scenes is so good, we almost want to argue and say "Look, it's shared state, and everybody knows that it's shared state." Buddy, it's one fine illusion.

link

megrimlock 4920 days ago

Thanks for this insightful comment.

I have struggled to find the right ways to think about the phenomena underlying the shared memory abstractions. The limited visibility makes this harder than understanding other aspects of software. You can run instrumented simulations like cachegrind; you can query arcane CPU performance counters (Intel's Nehalem optimization guide was eye-opening regarding how much goes on inside vs. what you might learn in an undergrad CPU design class) -- but at the end of the day you have to be guided by experimentation and measurement. And so many times those leave us with no good theories to explain the observed behavior.

(War story, feel free to skip: One time we were trying to speed up a datafile parser by any means possible -- which was already split into a producer-consumer thread pair, one thread running the unfortunately complex grammar and producing a stream of mutation commands, the other thread consuming that and building up in-memory state. The engineer working on this found that adding NOPs could speed this up, and he measured and charted a range of # of NOPs and chose the best one. Our best guess was "something to do with memory layout?" The outcome of the story was that we tore out a bunch of abstraction layers and ended up with a simpler, single-threaded parser that didn't need such heroic and bizarre efforts, but it also left us feeling a bit of vertigo with regard to memory hierarchy behavior.)

Your pointing out that the shared memory abstraction is backed by message-passing between hardware components (which each represent concurrent processes) is really interesting - thanks!

link

klodolph 4920 days ago

I guess this is getting downvoted, because... it detracts from the discussion, or something.

link

tdrd 4920 days ago

Obviously there's no avoiding this, but the author writes in the context of high-level code where OS-provided locking mechanisms are available. Why are we discussing this in a low-level (or embedded, where every ounce of performance matters) context?

link

psobot 4920 days ago

You don't need to be embedded to have performance matter - even if your OS gives you concurrency primitives, there are many situations where jumping into kernel code is still "too expensive."

link

justincormack 4920 days ago

That's why Linux has the futex which is userspace only if not contended only jumps to kernel code for the contended case.

link

pcwalton 4920 days ago

There are many situations in which OS-level primitives such as blocking semaphores are available but userland primitives are not. When implementing libc, for example. Or when implementing a language runtime, or TBB, or boost::thread_pool, or what have you.

link

ucee054 4920 days ago

Because the author works in embedded

link