Hacker News new | ask | show | jobs
by bsdetector 4698 days ago
> The most plausible explanation would amount essentially to a misconfigured library, not a fundamental advantage due to say, advanced JVM JIT.

Really, the most plausible explanation? I'd say the most plausible explanation is that M:N scheduling has always been bad at latency and fair scheduling. That's why everybody else abandoned it when that matters. It's basically only good for when fair and efficient scheduling doesn't matter, like maths for instance, which is why it's still used in Haskell and Rust. I wouldn't be surprised to see Rust at least abandon M:N soon though once they start really optimizing performance.

4 comments

Interestingly, both the go client and the scala client perform the same speed when talking to the scala server (~3.3s total), but the scala client performs much faster when talking to the go server (~1.9s total), whereas the go client performs much worse (~23s total, ~15s with GC disabled).

I thought the difference might partly be in socket buffering on the client, so I printed the size of the send and receive buffers on the socket in the scala client, and set them the same on the socket in the go client. This didn't actually bring the time down. Huh.

My next thought was that scala is somehow being more parallel when it evaluates the futures in Await.result. Running `tcpdump -i lo tcp port 1201` seems to confirm this. The scala client has a lot more parallelism (judging by packet sequence ids). Is that really because go's internal scheduling of goroutines is causing lock contention or lots of context switching?

And...googling a bit, it looks like that is the case: https://docs.google.com/document/d/1TTj4T2JO42uD5ID9e89oa0sL...

> Current goroutine scheduler limits scalability of concurrent programs written in Go, in particular, high-throughput servers and parallel computational programs. Vtocc server maxes out at 70% CPU on 8-core box, while profile shows 14% is spent in runtime.futex(). In general, the scheduler may inhibit users from using idiomatic fine-grained concurrency where performance is critical.

Bear in mind that was written before Go 1.1, additionally Dimitry has made steps to address CPU underutilization and has been working with the rest of the Go team on preemption. I think these improvements will make it into Go 1.2, fingers crossed.
Interesting, but now I'm even more confused. How can we possibly explain that a (go client -> go server) (which are in separate go processes) performs far worse than (go -> scala server), given that the go server seems to be better when using the scala client?

The comments on the article page have a different report which doesn't suffer from this implausibility:

go server + go client 22.02125152

scala server + scala client 3.469

go server + scala client 3.562

scala server + go client 4.766823392

> Interesting, but now I'm even more confused. How can we possibly explain that a (go client -> go server) (which are in separate go processes) performs far worse than (go -> scala server), given that the go server seems to be better when using the scala client?

I've been curious about that as well. The major slowdown seems to be related to a specific combination of go server and client. I don't have a good explanation. I'd love to hear from someone familiar with go internals.

> go server + go client 22.02125152 > ... > scala server + go client 4.766823392

That's roughly equivalent to my numbers.

Best response here. I spent weeks trying to get a go OpenFlow controller on par with Floodlight (java). I finally gave up on tcp performance and moved on when I realized scheduling was the problem.
I'm curious: are you saying Go is M:N and JVM is not? I had to look up M:N - http://en.wikipedia.org/wiki/Thread_(computing)#M:N_.28Hybri... - but ultimately I don't know anything about JVM or Go threading, and your comment didn't go enough into detail for me to follow your reasoning.
Yes I forget the audience. Go uses M:N scheduling meaning that the OS has M threads and Go multiplexes N of its own threads on top of these. The JVM uses N:1 like basically every other program where the kernel does all scheduling.

The basic problem with M:N scheduling is that the OS and program work against each other because they have imperfect information, causing inefficiencies.

Yes, but can Go actually use anything else? Finely-grained concurrency after the CSP fashion, after all, is the whole driving force behind it, and it's in the language spec.
Are hybrid approaches worth it (exposing some details so that Go network server can get the right service from the OS)? I'm not sure how much language complexity Go-nuts will take, so they'll probably look for clever heuristic tweaks instead.
You can turn off M:N on a per-thread (really per-thread-group) basis in Rust and we've been doing that for a while in parts of Servo. For example, the script/layout threads really want to be a separate thread from GL compositing.

Userland scheduling is still nice for optimizing synchronous RPC-style message sends so that they can switch directly to the target task without a trip through the scheduler. It's also nice when you want to implement work stealing.

Can you just have 1 thread per running task and give the thread back to a pool when the task waits for messages? Then for synchronous RPC you can swap the server task onto the current thread without OS scheduling and swap it back when it's done. You just need a combined 'send response and get next message' operation so the server and client can be swapped back again. This seems way easier and more robust, and you don't need work stealing since each running task has its own thread... what am I missing?
It doesn't work if you want to optimistically switch to the receiving task, but keep the sending task around with some work that it might like to do if other CPUs become idle. (For example, we've thought about scheduling JS GCs this way while JS is blocked on layout.)
Is the OS not scheduling M runnable threads on N cores? Blocking/non-blocking is just an API distinction, and languages implement one in terms of the other.
Goroutines are not threads. You can have a dozen goroutines which would only run on a smaller subset of OS threads.
They are threads. Technically they are "green threads". The runtime does not map them to OS threads, although technically if it chose to it could, because goroutines are abstract things and the mapping to real threads is a platform decision.