we are using libco[1] in our project[2]; it provides a good abstraction for architectures and provide different backends using ucontext, setjmp and native register management for x86 and x86_64.
The API is a bit more heavy weight than what I submitted to Ruby. That's because `libco` tries to avoid an assembler and instead dynamically loads binary data into an executable section. In my opinion it's an unnecessary performance cost. The benefit is to avoid an assembler at compile time.
I think it's better to use natively compiled code by the normal toolchain. I think it's a simpler design.
It's a very tough tradeoff. You're not the first to take aim at it, and that's fair.
Visual C++, GCC/Clang, and Intel C++ have quite different syntax for declaring inline assembler. I did not wish to burden libco users with having to use a specific compiler. Further, GCC/Clang can prove ... challenging to produce "naked" functions (without stack BP/SP adjustments), which is critical for libco's co_switch routine. I recall having trouble with a less popular OS. Using an external assembler like yasm or GNU as adds further dependencies on specific tools.
It is a very minor performance penalty to have the co_swap thunk, but it is non-zero, so I respect your decision. But do note at least that libco supports many architectures (without going pointlessly obscure.) It'd be a shame to miss out on cooperating in supporting architectures by duplicating our work over something like this.
With or without this, coroutines will always be slaughtered compared to a simple stackless state machine function, but they really shine and (in my opinion) are worth their overhead when you need to switch tasks in the middle of nested function calls.
...
# Put the first argument into the return value
movq %rdi, %rax
# We pop the return address and jump to it
ret
You'll notice that I get the return address into RAX as soon as possible. Believe it or not, this makes a real difference in performance. It allows the CPU to start fetching instructions after the JMP/RET even sooner than if you have it at the bottom of the function as you do.
The choice between push/pop and mov [rsp] for handling the non-volatile registers isn't really important. I found the latter slightly more performant on Athlon 64 CPUs, and a wash on Intel CPUs.
Preserving signals isn't so critical, but SSE is probably quite important, and only necessary on Windows. Again, your decision, but I would preserve it in your case. Can lead to really nasty surprises if you don't.
>You'll notice that I get the return address into RAX as soon as possible. Believe it or not, this makes a real difference in performance. It allows the CPU to start fetching instructions after the JMP/RET even sooner than if you have it at the bottom of the function as you do.
The "execution" of the jump happens well before the pop rax is executed.
The real difference is using a jump insted of a ret. The latter will be always mispredicted as the CPU return address predictor (the stack engine) will always get it wrong, while the indirect predictor used by jmp has a chance to get it right.
It's been about ten years since I made that code change, so I'm a bit fuzzy on the details, but in my mind, I really don't see how "pop rax; jmp rax" is going to be faster than "ret". I really feel like it was important to get the return address into rax as soon as possible.
And that, plus the ability to save/restore the xmm registers for the Win64 ABI (there is no push/pop xmm#), is why I used mov/movaps[] instead. I do recall it also testing faster on Athlons as well, but that was so long ago as to be irrelevant now, most likely.
Nostalgic! :3
In any case, I'd recommend benchmarking both under an application that tries to do as many cothread switches as possible per second. Theory's one thing, seeing the results in practice is always another.
> I really don't see how "pop rax; jmp rax" is going to be faster than "ret".
As I mentioned earlier this is due to jump predictions. jmp rax (or any register really) uses the indirect jump prediction, while ret uses a special dedicated predictor that uses a stack (the Return Stack Buffer or RSB), populated by call instructions, to predict the return address. In the coroutine case, the ret does not jump to the address of the last call so it will be mispredicted many time. The general indirect predictor, while not guaranteed to get it right, has at least a chance to predict the target especially when you have a small set of fibers calling each other in a deterministic sequence. In the general case with dozens of fibers calling each others from random locations, there is no chance of successful prediction in any case.
In principle a CPU could have a meta predictor that would chose between the general indirect predictor and the RSB, but this does not appear to be the case for current CPUs (the only switch to indirect when the stack buffer underflows).
Fibers are generally problematic for the return call predictor, even if the pop/jmp sequence is used, the call into the coroutine switch function will permanently damage the RBS, so any subsequent ret will be eventually mispredicted. For example:
In my coroutine library I have been experimenting with writing coro_switch in inline assembler, so there is no call to it (this only works if your target language support inline assembler). So if a fiber switches to another fiber (from any function nesting depth) which then switches back from the same nesting level, the predictor is not damaged. I think this is worth doing to optimize for coroutines used as generators.
BTW, I measured these effects multiple times.
Regarding populating the register well ahead of the jump, it is an interesting idea, but I do not think it matters much in practice. First of all, jumps and call are 'executed' at the fetch stage in the pipeline, this is about 10 clock cycles earlier than the stage that would execute the pop; because of OoO execution this might even be much earlier. Also performing a jump never has a dependency on anything as it is always executed under speculation, so 'pop rax' being completed or not has no impact. The register is only used much later to confirm or rollback the speculation.
If you take a look, you'll see that I specifically designed the API to avoid the need for any global/thread local state.
So, the argument to `coroutine_transfer` is passed to the coroutine and the calling coroutine is returned from `coroutine_transfer`.
So, using %rax is not possible, because it's the return value. That being said, another register would be fine. So you think there is a performance improvement from putting the return address specifically into %rax as soon as possible?
I was inspired by libco, and it is a great library.
I am happy to collaborate.
I tested the aarch64 implementation on a raspberry pi. Sometimes it's tricky to test. I was thinking of setting up a test harness for the different architectures.
Another idea I had was to distribute compiled `.o` files to avoid the assembler nightmare, which I agree is a problem.
The API is a bit more heavy weight than what I submitted to Ruby. That's because `libco` tries to avoid an assembler and instead dynamically loads binary data into an executable section. In my opinion it's an unnecessary performance cost. The benefit is to avoid an assembler at compile time.
I think it's better to use natively compiled code by the normal toolchain. I think it's a simpler design.