Hacker News new | ask | show | jobs
by openasocket 3199 days ago
How does this play with Go's scheduler? My understanding is that the Go scheduler is not preemptive, and goroutines are switched out at yield point, like the start of a function body. So tight loops that don't call other functions can effectively hog the OS thread until it leaves that loop body (No idea what happens when doing FFI, maybe that's done in a separate thread pool?). For most cases where you would use Go you aren't generally doing a bunch of CPU-bound work so that doesn't matter, but here you might run into some hiccups. I'm specifically thinking of a case where you use this library to do some heavy matrix operations as part of a web service, and those tight loops hog the OS threads and hurt your bandwidth and p90 latency.

My question to the developer: is that issue something you've encountered with this library? If not, did you design the library to periodically yield in tight loops, or am I just completely wrong about the Go scheduler?

3 comments

You're right in that people have seen high p90 latency as a result of things like base64 encoding large blocks.

But one thing to remember is that go inserts gc and pre-emption points at function call sites. So basically as long as a function is occasionally called you're good.

Cgo threading does complicate the matter. My understanding is that cgo calls are done in a threadpool with a larger stack size. I don't know the details about how that threadpool is managed. Not sure if this would help or hurt your concern.

Also, don't forget GOMAXPROCS. There's nothing stopping you from letting the go runtime spin up arbitrarily large number of OS threads.

So it's not an ideal situation, but if you're careful I don't think tight loops are likely to torpedo an otherwise sound go project.

I don't use Gonum with a webserver + large calculations so I can't definitively answer. No one has reported problems, but that could be a lack of usage. One thing though is that matrix multiplication (which is a kernel for higher-level operations) is written in a blocked format, and the code can be pre-empted on any of those blocks, so I wouldn't suspect it's a problem.
Yeah, skimming your source it seems most of your loops involve calling some function, and even if that's inlined I believe the Go compiler will put a speculative yield call in there.

I suppose my hypothetical would be an issue if you used a non-Go BLAS implementation, as calling out to C will hog the OS thread. But this is a known issue (e.x. https://www.cockroachlabs.com/blog/the-cost-and-complexity-o...).

The solution to that is to write a C-batcher. Gorgonia uses that (optional) - https://github.com/chewxy/gorgonia/tree/master/blase

(also it's undergoing major reconstruction/refactoring right now)

I'm not sure to understand that p90 latency problem, the cpu is used somewhere anyway so even if you use another language you won't be able to server a request while doing some intense cpu work?
The cpu will pause it to give all threads some cpu time. The difference is that it's the OS doing the work of cleaning up between threads, as opposed to the go runtime pausing and switching. Keeping it all in Go is faster, but it doesn't have the capability to pause, cleanup, and prepare for re-execution in the middle of a block of code that the OS does.