Hacker News new | ask | show | jobs
by krobertson 4722 days ago
I hadn't looked at that issue before, but have definitely been bit by it.

On our main codebase, we experienced some major issues when moving from Go 1.0 to 1.1 from this exact issue. We had a goroutine that was doing some remote calls wrapped with a timeout, and the call was consistently timing out even though the remote service was perfectly fine (it was another service on the same box).

We found the cause was another goroutine running something in an for loop that didn't do anything that would allow a pause in execution for another goroutine. So, the scheduler just obsessed on that goroutine, running none of the others until that one was done, and by then the timeout on the remote call had expired.

We fixed up that case, but also found a few others were we simply added a very small sleep call... for no other reason than allowing the scheduler to evaluate other goroutines. Meh. It made sense when we finally tracked it down, but was one of those things where we had to pause and ask "really?"... and adding a sleep call with comments "yes, I am really calling sleep".

6 comments

While it would be nice if the programmer never had to worry about this situation in the first place, when confronted with something like this I use runtime.Gosched() rather than a sleep call to yield. It more directly performs what you're attempting to do and is much more clearly self-documenting for situations where you really don't need to sleep for whatever period of time but do need to yield.
I hadn't seen runtime.Gosched() before, will take another look at it. Mentioned it to a coworker and they already knew of it, so maybe it was always switched to call it instead of sleep. :)
I had what I think was a related problem with file descriptor retention in timeouted socket code when writing a simple portscanner in Golang; I had to implement simple time-based flow control to give the runtime time to collect file descriptors, or I'd run out of them.
If you use Go's time.Sleep(), the goroutine will yield to all other goroutines while it is sleeping. Unfortunately, as georgemcbay says, using the system's sleep impedes Go's runtime as it cannot use that thread to run any goroutines in that time.

But yeah, I would say that is one of the quirks of Go at this time. You definitely need to be aware of how the scheduler works if you're using it in production.

This doesn't make a whole lot of sense to me. As long as you have GOMAXPROCS set > 1 this shouldn't happen, right? Go doesn't make a secret about only supporting cooperative multitasking above the OS thread level. Are you saying that one spinning thread was locking up all the others?

EDIT: Okay, saw your reply below. Just to reiterate, Erlang has to make a lot of throughput compromises to support pre-emptive multitasking. Just being compiled pretty much takes Go out of that conversation entirely. I'm happy with the tradeoffs for the kinds of things I need to do. And you can always set GOMAXPROCS to a _multiple_ of the number of cores on your machine to get OS-managed pre-empting.

Are you calling a system sleep, or a sleep implemented as part of Go's runtime? If it's a system sleep, and you're on Linux, sched_yield (http://man7.org/linux/man-pages/man2/sched_yield.2.html) may be more appropriate.

But, I acknowledge that is only a slightly-better kludge. It's still not ideal.

edit: georgemcbay's runtime.Gosched() appears to be the "right" kludge, but I'm leaving this up just in case others weren't aware of the system call.

What is your GO_MAX_PROCS value? Sounds like your're running single core?
Yes... the primary case where it was biting us was with our testing suite. It was a set of functional tests that had multiple goroutines going.

It was something that had always passed and never been an issue on 1.0.3, but started failing 85-90% of the time on Go 1.1 with no code changes. :(