Hacker News new | ask | show | jobs
by byuu 2937 days ago
I gave this a try:

https://hastebin.com/raw/ipodigizey

There's no difference between an inline assembler call function and a fully unrolled, inlined thread swap that avoids any calls at all. There's only an infinitesimal difference between these two versions and the libco method of using a byte array and a function thunk to invoke it in a compiler-independent way.

Further testing in bsnes (where tens of millions of thread switches occur per second) reveals absolutely no observable difference in performance.

It's a fun idea to toy around with this, but in the end any of these methods are going to be almost identical. You might as well pick whichever you like and go with that.

1 comments

One (but not both) of the coroutines need to do a return after invoking coro swap to see the advantage of avoiding the call, otherwise the additional call instruction is pretty much 0 cost.

Inline assembler does have another advantage though: you can use extended asm clobbers instead of saving registers manually. At the very least the compiler can do a better job of scheduling the instruction around (which is admittedly marginal on an OoO CPU), but in some cases (like this specific artificial test) can completely omit them. You should be able to see a coroutine jump every other cycle this way (i.e. the theoretical limit as most CPUs can only perform a taken jump that often).

At the end of the day depends on what you are aiming for: if a task is doing some reasonable amount of work between coroutine jumps (i.e. 'just' light weight threads), the canonical implementation is perfectly fine. My end goal is to make coroutines usable as generators, so probably even inline assembler is not enough there as ideally you would like the compiler to optimize, inline and vectorize across coroutine calls.