Hacker News new | ask | show | jobs
by smosher 4862 days ago
After reading this I am wondering if there have been any measurements taken of the time 'wasted' in the kernel. You make it sound like there's a lot less wasted than is claimed by the author.

I'm not convinced that the article has it exactly wrong. There are obvious problems, like all the extra moving parts your code has. But what about runtimes that multiplex their threads (or what have you) on to one host thread + scheduler per core? It seems to fit the description, and it has been taken up by Go and Rust and probably others (what is Erlang's multicore story these days?) It also avoids the obvious problem, since the extra moving parts are in the runtime, not your application (much like they were in the kernel, not your application in the standard multi-threaded case.)

I still don't know if it saves you much. IIUC, the choice to multiplex is motivated by other reasons.

1 comments

If you want to program using a ton of cheap threads (like in Go or Erlang or something), you pretty much have to do it in the runtime (i.e., use user-level threads), because you will waste a ton of time in the kernel on initialization and context switching, and you will waste a lot of memory.

In general, if you're using any other programming paradigm (i.e., 90+% of all software), you don't need tons of cheap threads; your application probably only needs at most one per core. In that case, you aren't constantly context switching (except to legitimately mutlitask with other programs), and you aren't using up that much kernel memory to hold thread state (because you only have a few threads, not a ton). So, you should just use the kernel as it was intended.

My assumption in reading the article is that the author was very much talking about the 90+% case (he didn't really say what he was talking about in the article).

There actually is another case though, which I think is what the author is really getting at. What if you have a small number of threads that need to be extremely high performance, and they have extremely short critical sections (which is not necessarily the common case across most applications)? Then, you would not want your threads to constantly suspend in the kernel every time there is a tiny bit of contention. You'd rather have them spin for a few cycles and just wait until they can actually get the lock... or just use hardware features (like the compare and exchange instructions) to abstract away the issue.

To do either of those, yeah, you pretty much need to code it yourself or (better) get a third party library. AFAIK. You would think pthreads would have a spinlock option that can never suspend (unlike futex, which can suspend), but if it does, I'm not aware of it.

In fact, formally speaking, there are lock-free algorithms, which don't have a lock, but can have unbounded retried before they get to access the data. And then you have wait-free algorithms, which can guarantee you get access to the data within a bounded number of retries.

Yeah, I get the different thread usage scenarios, but I'm still wondering about measurements. I'm less concerned about getting enough performance than I am about wasted performance, which is why I am thinking about measurements. If it's 10% that's something, if it's 0.0000000001% then who cares?

I also wonder about that 90+% case. Is it really 1:1 threads:cores at most? In my mind and experience, the architecture of the program (and of course the nature of the problem) should have a greater impact on the number of threads than leaving it at "are we using Go/Rust/Erlang vs. C/C++/Java." (Most recently I had several hundred coroutines on a single thread because the problem demanded it. In my world, the contended resource is never the CPU.) I'm wondering whether how much of the 90+% is really 1:1 and of that, how much really ought to be. Few problems are "as concurrent as the number of processors in the system" in the abstract. This is more of a question of nice design than anything else, but that's worth something (and the same reason we shouldn't needlessly include a scheduler in our programs.)

Another point is that most of these high-concurrency languages aim to satisfy that same 90+% in addition. In doing so, I wonder how much trouble they save (if any, given the runtime overhead, though within the runtime the scheduler itself had better be better than defaulting to the OS scheduler.)

But yes, I absolutely agree that you shouldn't be writing a scheduler in your application just to eek out a little bit of CPU overhead.