|
> I really don't see how "pop rax; jmp rax" is going to be faster than "ret". As I mentioned earlier this is due to jump predictions. jmp rax (or any register really) uses the indirect jump prediction, while ret uses a special dedicated predictor that uses a stack (the Return Stack Buffer or RSB), populated by call instructions, to predict the return address. In the coroutine case, the ret does not jump to the address of the last call so it will be mispredicted many time. The general indirect predictor, while not guaranteed to get it right, has at least a chance to predict the target especially when you have a small set of fibers calling each other in a deterministic sequence. In the general case with dozens of fibers calling each others from random locations, there is no chance of successful prediction in any case. In principle a CPU could have a meta predictor that would chose between the general indirect predictor and the RSB, but this does not appear to be the case for current CPUs (the only switch to indirect when the stack buffer underflows). Fibers are generally problematic for the return call predictor, even if the pop/jmp sequence is used, the call into the coroutine switch function will permanently damage the RBS, so any subsequent ret will be eventually mispredicted. For example: Fiber 1:
call foo:
call coro_switch
jump fiber 2
Fiber 2:
call coro_switch
jump fiber 1
Fiber 1
ret // return from foo call, always mispredicted
In my coroutine library I have been experimenting with writing coro_switch in inline assembler, so there is no call to it (this only works if your target language support inline assembler). So if a fiber switches to another fiber (from any function nesting depth) which then switches back from the same nesting level, the predictor is not damaged. I think this is worth doing to optimize for coroutines used as generators.BTW, I measured these effects multiple times. Regarding populating the register well ahead of the jump, it is an interesting idea, but I do not think it matters much in practice. First of all, jumps and call are 'executed' at the fetch stage in the pipeline, this is about 10 clock cycles earlier than the stage that would execute the pop; because of OoO execution this might even be much earlier. Also performing a jump never has a dependency on anything as it is always executed under speculation, so 'pop rax' being completed or not has no impact. The register is only used much later to confirm or rollback the speculation. |
https://hastebin.com/raw/ipodigizey
There's no difference between an inline assembler call function and a fully unrolled, inlined thread swap that avoids any calls at all. There's only an infinitesimal difference between these two versions and the libco method of using a byte array and a function thunk to invoke it in a compiler-independent way.
Further testing in bsnes (where tens of millions of thread switches occur per second) reveals absolutely no observable difference in performance.
It's a fun idea to toy around with this, but in the end any of these methods are going to be almost identical. You might as well pick whichever you like and go with that.