|
|
|
|
|
by byuu
2942 days ago
|
|
It's been about ten years since I made that code change, so I'm a bit fuzzy on the details, but in my mind, I really don't see how "pop rax; jmp rax" is going to be faster than "ret". I really feel like it was important to get the return address into rax as soon as possible. And that, plus the ability to save/restore the xmm registers for the Win64 ABI (there is no push/pop xmm#), is why I used mov/movaps[] instead. I do recall it also testing faster on Athlons as well, but that was so long ago as to be irrelevant now, most likely. Nostalgic! :3 In any case, I'd recommend benchmarking both under an application that tries to do as many cothread switches as possible per second. Theory's one thing, seeing the results in practice is always another. |
|
As I mentioned earlier this is due to jump predictions. jmp rax (or any register really) uses the indirect jump prediction, while ret uses a special dedicated predictor that uses a stack (the Return Stack Buffer or RSB), populated by call instructions, to predict the return address. In the coroutine case, the ret does not jump to the address of the last call so it will be mispredicted many time. The general indirect predictor, while not guaranteed to get it right, has at least a chance to predict the target especially when you have a small set of fibers calling each other in a deterministic sequence. In the general case with dozens of fibers calling each others from random locations, there is no chance of successful prediction in any case.
In principle a CPU could have a meta predictor that would chose between the general indirect predictor and the RSB, but this does not appear to be the case for current CPUs (the only switch to indirect when the stack buffer underflows).
Fibers are generally problematic for the return call predictor, even if the pop/jmp sequence is used, the call into the coroutine switch function will permanently damage the RBS, so any subsequent ret will be eventually mispredicted. For example:
In my coroutine library I have been experimenting with writing coro_switch in inline assembler, so there is no call to it (this only works if your target language support inline assembler). So if a fiber switches to another fiber (from any function nesting depth) which then switches back from the same nesting level, the predictor is not damaged. I think this is worth doing to optimize for coroutines used as generators.BTW, I measured these effects multiple times.
Regarding populating the register well ahead of the jump, it is an interesting idea, but I do not think it matters much in practice. First of all, jumps and call are 'executed' at the fetch stage in the pipeline, this is about 10 clock cycles earlier than the stage that would execute the pop; because of OoO execution this might even be much earlier. Also performing a jump never has a dependency on anything as it is always executed under speculation, so 'pop rax' being completed or not has no impact. The register is only used much later to confirm or rollback the speculation.