Hacker News new | ask | show | jobs
by pcwalton 3148 days ago
It doesn't really "cut through" the debate any more than any other implementation of threads does. The only difference between Go and plain old one-thread-per-connection is that regular threads run in the kernel, while Go threads run in userspace. That's not a semantic difference, only an implementation detail (a large detail, to be clear, but still an implementation detail).

There were historical implementations of pthreads, such as NGPT, that used precisely the same model as Go, and they were abandoned because the advantages over 1:1 were not sufficient to justify the complexity.

3 comments

What you call a "Go thread" has a precise name (goroutine) and running in userspace is hardly the only difference between a goroutine and a kernel thread.

Creating and destroying kernel threads is significantly more expensive.

A kernel thread has a fixed stack and if you go beyond, you crash. Which means that you have to create kernel threads with worst-case-scenario stack sizes (and pray that you got it right).

Goroutine has an expandable stack and starts with very small stack (which is partly why it's faster; setting up kernel page mappings to create a contiguous space for a large stack is not free).

Finally, goroutine scheduling is different than kernel thread scheduling: a blocked goroutine consumes no CPU cycles.

In a 4 core CPU there is no point in running more than 4 busy kernel threads but kernel scheduler has to give each thread a chance to run. The more threads you have, the more time kernel spends and pointless work of ping-ponging between threads. That hurts throughput, especially when we're talking about high-load servers (serving thousands or even millions of concurrent connections).

Go runtime only creates as many threads as CPUs and avoids this waste.

That's why high-perf servers (like nginx) don't just use kernel thread per connection and go through considerable complexity of writing event driven code.

Go gives you straightforward programming model of thread-per-connection with scalability and performance much closer to event-driven model.

You work on Rust and are well informed about this topic so I'm sure you know all of that.

Which is why it amazes me the lengths to which you go to denigrate Go in that respect and minimize what is a great and unique programming model among mainstream languages.

> What you call a "Go thread" has a precise name (goroutine)

I call goroutines threads because they are user-level threads.

As an analogy, NVIDIA calls local threadgroups "warps", but that doesn't make them not local threadgroups.

> Creating and destroying kernel threads is significantly more expensive.

Because kernel threads usually have larger stacks. But they don't always have large stacks: that is configurable. Other than the stack size, the primary difference is simply that kernel threads are created in kernel space and user threads are created in userspace.

> A kernel thread has a fixed stack and if you go beyond, you crash. Which means that you have to create kernel threads with worst-case-scenario stack sizes (and pray that you got it right).

You can do stack switching in 1:1 too. After all, if you couldn't, then Go couldn't do stack switching at all, since goroutines are built on top of kernel threads.

Go's small stacks are really a property of the moving GC, not a property of the threading model.

> In a 4 core CPU there is no point in running more than 4 busy kernel threads but kernel scheduler has to give each thread a chance to run.

> Go runtime only creates as many threads as CPUs and avoids this waste.

Not if they're blocked doing I/O!

If they're not blocked doing I/O, then Go tries to do preemption just as the kernel does. (I say "tries to" because Go currently cannot preempt outside function boundaries; this is a significant downside of M:N threading compared to 1:1 kernel threading.)

> That's why high-perf servers (like nginx) don't just use kernel thread per connection and go through considerable complexity of writing event driven code.

High-performance servers like nginx use an event loop because it's the only way to get the absolute fastest performance, with no overhead of stacks at all. The fact that the project described in the article gets better performance than Go's threads is proof of that fact, in fact.

It would be possible, and interesting, to do Go-like 1:1 threading with small stacks.

> Go gives you straightforward programming model of thread-per-connection with scalability and performance much closer to event-driven model.

Sure. But that's mostly because of the GC, not because of the M:N threading model.

> Which is why it amazes me the lengths to which you go to denigrate Go in that respect and minimize what is a great and unique programming model among mainstream languages.

It's not unique. As I said, NGPT used to do M:N for pthreads. Solaris used to do M:N for pthreads. The JVM used to do M:N.

Nope, the JVM used to do M:1, it's very different from M:N.
The goroutine implementation scales, while other thread implementations (by default) do not. That's a semantic difference. A Go server can have millions of active goroutines with moderate resource use.

You can achieve the same on Linux or Solaris using kernel threads, but you have to work at it. With Go you don't have to work at it, and it works on macOS and Windows and a few other OSs too.

This is all comparisons between O(1) things, but the constant factor matters.

> You can achieve the same on Linux or Solaris using kernel threads, but you have to work at it.

By setting the thread stack size to a reasonable value. That's it. And, in fact, on 64-bit you often don't even need to do that.

The difference you're describing is a difference in default thread stack sizes, which is hardly a paradigm shift. We're talking about one call to pthread_attr_setstacksize().

It's not nearly as simple as you claim.

First: if you have an epoll loop it is also the cost of the thread context switch, which has definitely us in RPC systems using kernel threads. By contrast the goroutine gets scheduled onto the kernel thread that answered the poll, saving the switch.

Second: as I alluded to earlier, linux and solaris can scale their kernel thread implementations, not all OSs can. My experiences with large numbers of threads on the BSDs and Windows (in years past admittedly) suggest other kernels don't have thread implementations designed to scale to such high numbers. Solving the problem in userspace means Go programs written in this style are portable across operating systems.

Third: you can only adjust stack sizes down if you know your program always keeps its stacks small. If you depend on libraries you don't own in C/C++, that's a difficult assumption. Go grows the stacks, so if you hit some corner case where a small number of goroutines need some significant amount of stack, your program uses more memory, but typically keeps working. No need for careful (manual!) stack accounting.

If all this were as easy as you say, we would still write nearly all our C/C++ servers using threads. We don't because it's not.

> First: if you have an epoll loop it is also the cost of the thread context switch, which has definitely us in RPC systems using kernel threads. By contrast the goroutine gets scheduled onto the kernel thread that answered the poll, saving the switch.

I'm not comparing M:N to a 1:1 system where all I/O is proxied out to another thread sitting in an epoll loop. I'm comparing M:N to 1:1 with blocking I/O. In this scenario, the kernel switches directly onto the appropriate thread.

> Second: as I alluded to earlier, linux and solaris can scale their kernel thread implementations, not all OSs can.

The vast majority of Go users are running Linux. And on Windows, UMS is 1:1 and is the preferred way to do high-performance servers; it avoids a lot of the problems that Go has (for instance, playing nicely with third-party code).

> Third: you can only adjust stack sizes down if you know your program always keeps its stacks small.

You could do 1:1 with stack growth just as Go does. As I've said before, small stacks are a property of the relocatable GC, not a property of the thread implementation.

> If all this were as easy as you say, we would still write nearly all our C/C++ servers using threads.

We don't write C/C++ servers using threads because (1) stackless use of epoll is faster than both 1:1 threading and M:N threading, as this project shows; (2) C/C++ can't do relocatable stacks, as the language is hostile to precise moving GC.

First a point of curiosity, have you seen a linux 1:1 system with blocking I/O scaled to millions of active threads? I have only ever seen it with epoll. My working assumption has been that the kernel blocking calls won't scale, but I have not tested that.

Second, almost all the event-driven C++ servers I have seen are written that way not for performance, but for scaling and latency. There is usually plenty of extra CPU and RAM, only a tiny fraction really bump up against resource limits. (A typical case of the vast majority of code not being performance sensitive.)

Otherwise, I agree with your points in this comment. Especially the broader point that there's no novel component of Go. Go is about combining well-known things together.

However, it seems to me that Go still cuts through the "threads vs. events" argument in a way nothing else does. I can write code in a blocking style using typical libraries, and have it scale to large numbers of active connections.

On other systems the implementations don't scale or I have to heavily restrict library use based on stack growth, or I am tied to a particular OS. It seems to me the only alternatives to Go's nice blocking code environment require significant compromise or require something to be built.

Choice of 1:1 or M:N is all about trade offs. NPLT chose 1:1 for simplicity (and decided to focus instead on making context switches cheap as possible in the Linux kernel). But that doesn’t mean M:N has no benefits - I think it does, as golang, erlang, and other languages illustrate.

I agree with OP that golang seems to provide the best of both worlds in the “event” vs “thread” debate. We can get the performance benefits of an eventing model with a much simpler programming model of thread per request.

It’s all “semantically” similar but it’s the details that matter. And I think golang chose the correct trade offs here (and with their sub-ms GC as well). The JVM, as an opposing example, made all the wrong choices I think for the general use case. Slow GCs and 1:1 threading.

I always understood the overhead of kernel threads compared to user threads to be significant at large scale. It’s not just stacks either. It can be a lot cheaper to swap between user threads, depending on implementation, compared to the scheduler having to preempt and trap into kernel code and provide a general purpose context switch.