Hacker News new | ask | show | jobs
by mananaysiempre 241 days ago
> Context switching is virtually free, comparable to a function call.

If you’re counting that low, then you need to count carefully.

A coroutine switch, however well implemented, inevitably breaks the branch predictor’s idea of your return stack, but the effect of mispredicted returns will be smeared over the target coroutine’s execution rather than concentrated at the point of the switch. (Similar issues exist with e.g. measuring the effect of blowing the cache on a CPU migration.) I’m actually not sure if Zig’s async design even uses hardware call/return pairs when a (monomorphized-as-)async function calls another one, or if every return just gets translated to an indirect jump. (This option affords what I think is a cleaner design for coroutines with compact frames, but it is much less friendly to the CPU.)

So a foolproof benchmark would require one to compare the total execution time of a (compute-bound) program that constantly switches between (say) two tasks to that of an equivalent program that not only does not switch but (given what little I know about Zig’s “colorless” async) does not run under an async executor(?) at all. Those tasks would also need to yield on a non-trivial call stack each time. Seems quite tricky all in all.

4 comments

If you constantly switch between two tasks from the bottom of their call stack (as for stackless coroutines) and your stack switching code is inlined, then you can mostly avoid the mispaired call/ret penalty.

Also, if you control the compiler, an option is to compile all call/rets in and out of "io" code in terms of explicit jumps. A ret implemented as pop+indirect jump will be less less predictable than a paired ret, but has more chances to be predicted than an unpaired one.

My hope is that, if stackful coroutines become more mainstreams, CPU microarchitectures will start using a meta-predictor to chose between the return stack predictor and the indirect predictor.

> I’m actually not sure if Zig’s async design even uses hardware call/return pairs

Zig no longer has async in the language (and hasn't for quite some time). The OP implemented task switching in user-space.

Even so. You're talking about storing and loading at least ~16 8-byte registers, including the instruction pointer which is essentially a jump. Even to L1 that takes some time; more than a simple function call (jump + pushed return address).
Only stack and instruction pointer are explicitly restored. The rest is handled by the compiler, instead of depending on the C calling convention, it can avoid having things in registers during yield.

See this for more details on how stackful coroutines can be made much faster:

https://photonlibos.github.io/blog/stackful-coroutine-made-f...

> The rest is handled by the compiler, instead of depending on the C calling convention, it can avoid having things in registers during yield.

Yep, the frame pointer as well if you're using it. This is exactly how its implemented in user-space in Zig's WIP std.Io branch green-threading implementation: https://github.com/ziglang/zig/blob/ce704963037fed60a30fd9d4...

On ARM64, only fp, sp and pc are explicitly restored; and on x86_64 only rbp, rsp, and rip. For everything else, the compiler is just informed that the registers will be clobbered by the call, so it can optimize allocation to avoid having to save/restore them from the stack when it can.

Is this just buttering the cost of switches by crippling the optimization options compiler have?
If this was done the classical C way, you would always have to stack-save a number of registers, even if they are not really needed. The only difference here is that the compiler will do the save for you, in whatever way fits the context best. Sometimes it will stack-save, sometimes it will decide to use a different option. It's always strictly better than explicitly saving/restoring N registers unaware of the context. Keep in mind, that in Zig, the compiler always knows the entire code base. It does not work on object/function boundaries. That leads to better optimizations.
I wonder how you see it. Stackful coroutines switch context on syscall in the top stack frame, the deeper frames are regular optimized code, but syscall/sysret is already big context switch. And read/epoll loop has exactly same structure, the point of async programming isn't optimization of computation, but optimization of memory consumption. Performance is determined by features and design (and Electron).
What do you mean by "buttering the cost of switches", can you elaborate? (I am trying to learn about this topic)
Which, with store forwarding, can be shockingly cheap. You may not actually be hitting L1, and if you are, you're probably not hitting it synchronously.

https://easyperf.net/blog/2018/03/09/Store-forwarding

and, section 15.10 of https://www.agner.org/optimize/microarchitecture.pdf

Are you talking about context switching every handful of cycles? This is going to be extremely inefficient even with store forwarding.
Sure, and so is calling a function every handful of cycles. That's a big part of why compilers inline.

Either you're context switching often enough that store forwarding helps, or you're not spending a lot of time context switching. Either way, I would expect that you aren't waiting on L1: you put the write into a queue and move on.

You are right that the statement was overblown, however when I was testing with "trivial" load between yields (synchronized ping-pong between coroutines), I was getting numbers that I had trouble believing, when comparing them to other solutions.
In my test of a similar setup in C++ (IIRC about 10 years ago!), I was able to do a context switch every other cycle. The bottleneck was literally the cycles per taken jump of the microarchitecture I was testing again. As in your case it was a trivial test with two coroutines doing nothing except context switching, so the compiler had no need to save any registers at all and I carefully defined the ABI to be able to keep stack and instruction pointers in registers even across switches.
Semi-unrelated, but async is coming soon to Zig. I'm sorta holding off getting deep into Zig until it lands. https://kristoff.it/blog/zig-new-async-io/
the point of all this io stuff is that you'll be able to start playing with zig before async comes and when async comes it will be either drop in if you choose an async io for main() or it will be a line or two of code if you pick an event loop manually.