Hacker News new | ask | show | jobs
by javert 4866 days ago
So, this post has a number of errors, and is fundamentally wrong.

(a) pthread_mutex_t and friends use futexes, which only call into the kernel when there actually is contention.

(b) it would be better to use chrt (change to real-time priority) than the maxcpus trick, because the former accomplishes the same thing, but allows the core to still be used if the high-priority thread suspends (e.g. to do disk or network I/o).

(c) Contrary to his claim about Snort, there is no reason to prefer a multiprocess design over a multithreaded design for a particular application. There is no savings in overhead or synchronization or anything like that by going with processes. In fact, using processes and then using memory mapping to share when you could use threads, is just making things harder for yourself for no reason.

(d) What I’m trying to show you here is that “multi-core” doesn’t automatically mean “multi-threaded”. Well, in computer science terminology, a thread is a schedulable entity, and a process is a schedulable entity with memory protection. So, he's wrong. Lots of developers talk about threads and processes as orthogonal things, though, so I can see why he made that claim.

(e) The overall theme of my talk was to impress upon the audience that in order to create scalable application, you need to move your code out of the operating system kernel. You need to code everything yourself instead of letting the kernel do the heavy lifting for you. That is horrible advice that is just going to lead to lots of bugs and wasted effort. It's premature optimization. Even most people using the Linux realtime preemption patch (PREEMPT_RT) do not have such strict requirements that they need to take this advice.

(f) Your basic design should be one thread per core and lock-free synchronization that never causes a thread to stop and wait. Might apply to certain very specific real-time (as in, embedded systems or HFT) scenarios, but in general, no, you're just wasting the core when that one thread doesn't need to use it. Prefer real-time priorities if you really need it.

(g) Specifically, I’ve tried to drill into you the idea that what people call “multi-threaded” coding is not the same as “multi-core”. Multi-threaded techniques, like mutexes, don’t scale on multi-core. Again, you can only use multiple cores in parallel if there are multiple threads. And multi-threaded techniques do scale. You definitely may want to use lock-free synchronization instead of mutexes in some specialized scenarios, though.

EDIT: OK, here is one other thing I forgot in the list above.

(h) There are multiple types of locks, like spinlocks, mutexes, critical sections, semaphores, and so on. Even among these classes there are many variations. Technically, mutexes and semaphores both are ways of protecting critical sections, and a spinlock is a way of implementing a lock (including, possibly, a mutex or semaphore lock). Again, this is to some degree the difference between developers with a shared lingo and computer scientists. But if you go by that kind of lingo, you're missing part of the picture.

4 comments

Agreed.

I tried to benchmark the context switching of user-level threads vs kernel-level threads [0]. Specifically, libtask vs pthread. I assumed, pthread_yield would be very expensive, because it has to do a syscall into the kernel. Something like an order of magnitude or two. Interestingly, with two threads switching back and forth was only two times as slow as the user-level threads. Linux futexes seem to be really clever. Use them!

On the other hand, pthread_yield scales linearly, while libtask has a constant time yield. So for concurrency without parallelism [1] and lots of threads, the libtask library can be recommended.

[0] https://bitbucket.org/qznc/switchcost/src

[1] http://existentialtype.wordpress.com/2011/03/17/parallelism-...

(a) What part of "In the Linux pthread_mutex_t, when code stops and waits, it does a system call to return back to the kernel" do you not understand? That's how "futex" works: when it fails to get a lock, it must stop and wait, and therefore does a system call. When it doesn't have to wait because there is no contention, then it doesn't make a system call.

(b) The entire point is that for really scalable network applications, you don't want the high-priority thread to suspend for things like disk IO. Really, you read all that and expected the thread to use blocking IO instead of asynchronous IO???

(c) The point wasn't that Snort's multi-process design is better, only that it's acceptable. It's a single-threaded design written in the 1990s that has lots of quirks that make it hard to convert to multi-threaded operation. The point is that you can still get it to multi-core without having to make it multi-threaded.

(d) How is "multi-core doesn't mean multi-threaded"??? Snort is a multi-core app today and isn't multi-threaded.

(e) I keep repeating the claim because my code written in user mode scales better than the code of people who disagree. My code scales to 20-million connections, 20-gbps, 20-million packets/second. What does your code scale to?

(f) "Real-time" is different than the "network" apps I'm talking about. The entire idea is "control plane" vs. "data plane" separation. Control operations that need real-time guarantees are very different than high-throughput data operations.

(g) Multi-threaded techniques that try to interleave multiple threads on a single core suck for high network throughput. Just ask Apache

A lot of what you're saying hinges on being specific to what you call "really scalable network applications," but you never said anything about that in your post.

(d) How is "multi-core doesn't mean multi-threaded"??? Snort is a multi-core app today and isn't multi-threaded.

This is, again, between whether you want to be "computer science correct," or use developer lingo that is not even necessarily consistent among all developer and OS subcultures. I think saying it the way you are saying it, instead of "You could use multiple processes to get parallelism, instead of multiple threads," is misleading.

What does your code scale to?

Well, I'm a real-time guy, so my thread scales down to microsecond overheads.

Did you change the text in your post that explains pthread_mutex_t? If not, I probably made a mistake to pick on that, because what you have there is pretty clear.

After reading this I am wondering if there have been any measurements taken of the time 'wasted' in the kernel. You make it sound like there's a lot less wasted than is claimed by the author.

I'm not convinced that the article has it exactly wrong. There are obvious problems, like all the extra moving parts your code has. But what about runtimes that multiplex their threads (or what have you) on to one host thread + scheduler per core? It seems to fit the description, and it has been taken up by Go and Rust and probably others (what is Erlang's multicore story these days?) It also avoids the obvious problem, since the extra moving parts are in the runtime, not your application (much like they were in the kernel, not your application in the standard multi-threaded case.)

I still don't know if it saves you much. IIUC, the choice to multiplex is motivated by other reasons.

If you want to program using a ton of cheap threads (like in Go or Erlang or something), you pretty much have to do it in the runtime (i.e., use user-level threads), because you will waste a ton of time in the kernel on initialization and context switching, and you will waste a lot of memory.

In general, if you're using any other programming paradigm (i.e., 90+% of all software), you don't need tons of cheap threads; your application probably only needs at most one per core. In that case, you aren't constantly context switching (except to legitimately mutlitask with other programs), and you aren't using up that much kernel memory to hold thread state (because you only have a few threads, not a ton). So, you should just use the kernel as it was intended.

My assumption in reading the article is that the author was very much talking about the 90+% case (he didn't really say what he was talking about in the article).

There actually is another case though, which I think is what the author is really getting at. What if you have a small number of threads that need to be extremely high performance, and they have extremely short critical sections (which is not necessarily the common case across most applications)? Then, you would not want your threads to constantly suspend in the kernel every time there is a tiny bit of contention. You'd rather have them spin for a few cycles and just wait until they can actually get the lock... or just use hardware features (like the compare and exchange instructions) to abstract away the issue.

To do either of those, yeah, you pretty much need to code it yourself or (better) get a third party library. AFAIK. You would think pthreads would have a spinlock option that can never suspend (unlike futex, which can suspend), but if it does, I'm not aware of it.

In fact, formally speaking, there are lock-free algorithms, which don't have a lock, but can have unbounded retried before they get to access the data. And then you have wait-free algorithms, which can guarantee you get access to the data within a bounded number of retries.

Yeah, I get the different thread usage scenarios, but I'm still wondering about measurements. I'm less concerned about getting enough performance than I am about wasted performance, which is why I am thinking about measurements. If it's 10% that's something, if it's 0.0000000001% then who cares?

I also wonder about that 90+% case. Is it really 1:1 threads:cores at most? In my mind and experience, the architecture of the program (and of course the nature of the problem) should have a greater impact on the number of threads than leaving it at "are we using Go/Rust/Erlang vs. C/C++/Java." (Most recently I had several hundred coroutines on a single thread because the problem demanded it. In my world, the contended resource is never the CPU.) I'm wondering whether how much of the 90+% is really 1:1 and of that, how much really ought to be. Few problems are "as concurrent as the number of processors in the system" in the abstract. This is more of a question of nice design than anything else, but that's worth something (and the same reason we shouldn't needlessly include a scheduler in our programs.)

Another point is that most of these high-concurrency languages aim to satisfy that same 90+% in addition. In doing so, I wonder how much trouble they save (if any, given the runtime overhead, though within the runtime the scheduler itself had better be better than defaulting to the OS scheduler.)

But yes, I absolutely agree that you shouldn't be writing a scheduler in your application just to eek out a little bit of CPU overhead.

Hm... (c) needs a little qualification. Certainly in a multithreaded environment it's possible to end up in a cache contention situation accidentally, simply because you or the compiler happened to put two "write often from a single thread" values in the same cache line. Think of things like global/singleton objects which doesn't really need to be shared but are. A multiprocess design that limited shared memory to stuff that's truly shared won't have that pitfall.
No, that's not right.

Cache conflicts happen between processes in the same way, and for the same reasons, and just as often, as they do between threads.

No, you misunderstand. A global Java object (or whatever) in one process will be shared across all threads. If it has fields that get written "often" they are likely to end up thrashing the L1/L2 caches with snoop ejections. Often those writes are essentially spurious (global statistics, "current" thingy pointers) and don't really need to be shared across all threads. This really happens, and it's very easy to do.

An essentially identical architecture that defined the same object within a single address space is, quite clearly, not going to be subject to that bug.

An essentially identical architecture that defined the same object within a single address space

Well, then it would be the same architecture, because the one you described (one process with multiple threads) is also a single address space.

Still, I think you're wrong. The cache system doesn't have any awareness of threads or processes. So, you get spurious evictions all the time from completely unrelated programs, no matter what you do. Or from unrelated data in the same program that just happens to map to the same cache line.

If you still disagree with me and you can give me a specific example, I'd really appreciate it, because if I am misunderstanding something, it's a big deal, and I want to double check myself.

The cache system isn't aware of threads or processes, but it's clearly aware of address spaces. Make a big global Foo object in a single process and spawn a hundred threads to write to it. They'll be contending on the same cache line. You're with me that far, right?

Now take the same architecture and spawn a hundred processes. Those Foo objects now live in different physical pages and thus writes to them from all the processes live happily in L1 cache.

Obviously not all architectures work like this. If Foo is "really" shared, then nothing can help the contention. But usually it's not, it's just that the code was written by someone who didn't think about cache contention. That kind of performance bug is really easy to write. And a reasonable fix for it is "don't use multithreaded architectures when you don't need them".

The cache system isn't aware of threads or processes, but it's clearly aware of address spaces.

No, it's definitely not aware of address spaces. Caching is based on the physical address.

In fact, it's based on the low-order bits of the physical address. So, things that are in different physical pages but have the same lower-order bits can conflict.