| HN Mirror

One (but not both) of the coroutines need to do a return after invoking coro swap to see the advantage of avoiding the call, otherwise the additional call instruction is pretty much 0 cost.

Inline assembler does have another advantage though: you can use extended asm clobbers instead of saving registers manually. At the very least the compiler can do a better job of scheduling the instruction around (which is admittedly marginal on an OoO CPU), but in some cases (like this specific artificial test) can completely omit them. You should be able to see a coroutine jump every other cycle this way (i.e. the theoretical limit as most CPUs can only perform a taken jump that often).

At the end of the day depends on what you are aiming for: if a task is doing some reasonable amount of work between coroutine jumps (i.e. 'just' light weight threads), the canonical implementation is perfectly fine. My end goal is to make coroutines usable as generators, so probably even inline assembler is not enough there as ideally you would like the compiler to optimize, inline and vectorize across coroutine calls.