|
One thing that really goes against my intuition is that user space threads (lightweight treads, goroutines) are faster than kernel threads. Without knowing too much assembly, I would assume any modern processor would make a context switch a one instruction affair. Interrupt -> small scheduler code picks the thread to run -> LOAD THREAD instruction and the processor swaps in all the registers and the instruction pointer. You probably can't beat that in user space, especially if you want to preempt threads yourself. You'd have to check after every step, or profile your own process or something like that. And indeed, Go's scheduler is cooperative. But then, why can't you get the performance of Goroutines with OS threads? Is it just because of legacy issues? Or does it only work with cooperative threading, which requires language support? One thing I'm missing from that article is how the cooperativeness is implemented. I think in Go (and in Java's Project Loom), you have "normal code", but then deep down in network and IO functions, you have magic "yield" instructions. So all the layers above can pretend they are running on regular threads, and you avoid the "colored function problem", but you get runtime behavior similar to coroutines. Which only works if really every blocking IO is modified to include yielding behavior. If you call a blocking OS function, I assume something bad will happen. |
It hasn't been cooperative for a few versions now, the scheduler became preemptive in 1.14. And before that there were yield points at every function prolog (as well as all IO primitives) so there were relatively few situations where cooperation was necessary.
> Without knowing too much assembly, I would assume any modern processor would make a context switch a one instruction affair.
Any context switch (to the kernel) is expensive, and way more than a single operation. The kernel also has a ton of stuff to do, it's not just "picks the thread to run", you have to restore the ip and sp, but also may have to restore FPU/SSE/AVX state (AVX512 is over 2KB of state), traps state.
Kernel-level context switching costs on the order of 10x what userland context switching does: https://eli.thegreenplace.net/2018/measuring-context-switchi...
> LOAD THREAD
There is no load thread instruction