Hacker News new | ask | show | jobs
by gpderetta 2935 days ago
>You'll notice that I get the return address into RAX as soon as possible. Believe it or not, this makes a real difference in performance. It allows the CPU to start fetching instructions after the JMP/RET even sooner than if you have it at the bottom of the function as you do.

The "execution" of the jump happens well before the pop rax is executed.

The real difference is using a jump insted of a ret. The latter will be always mispredicted as the CPU return address predictor (the stack engine) will always get it wrong, while the indirect predictor used by jmp has a chance to get it right.

1 comments

Do you both mind elaborating a bit on this? I have to admit CPU instruction scheduling is not really my area.. I'd love to hear more.
It's been about ten years since I made that code change, so I'm a bit fuzzy on the details, but in my mind, I really don't see how "pop rax; jmp rax" is going to be faster than "ret". I really feel like it was important to get the return address into rax as soon as possible.

And that, plus the ability to save/restore the xmm registers for the Win64 ABI (there is no push/pop xmm#), is why I used mov/movaps[] instead. I do recall it also testing faster on Athlons as well, but that was so long ago as to be irrelevant now, most likely.

Nostalgic! :3

In any case, I'd recommend benchmarking both under an application that tries to do as many cothread switches as possible per second. Theory's one thing, seeing the results in practice is always another.

> I really don't see how "pop rax; jmp rax" is going to be faster than "ret".

As I mentioned earlier this is due to jump predictions. jmp rax (or any register really) uses the indirect jump prediction, while ret uses a special dedicated predictor that uses a stack (the Return Stack Buffer or RSB), populated by call instructions, to predict the return address. In the coroutine case, the ret does not jump to the address of the last call so it will be mispredicted many time. The general indirect predictor, while not guaranteed to get it right, has at least a chance to predict the target especially when you have a small set of fibers calling each other in a deterministic sequence. In the general case with dozens of fibers calling each others from random locations, there is no chance of successful prediction in any case.

In principle a CPU could have a meta predictor that would chose between the general indirect predictor and the RSB, but this does not appear to be the case for current CPUs (the only switch to indirect when the stack buffer underflows).

Fibers are generally problematic for the return call predictor, even if the pop/jmp sequence is used, the call into the coroutine switch function will permanently damage the RBS, so any subsequent ret will be eventually mispredicted. For example:

    Fiber 1:
        call foo:
            call coro_switch
               jump fiber 2
    Fiber 2:
        call coro_switch
            jump fiber 1
    Fiber 1
            ret // return from foo call, always mispredicted
In my coroutine library I have been experimenting with writing coro_switch in inline assembler, so there is no call to it (this only works if your target language support inline assembler). So if a fiber switches to another fiber (from any function nesting depth) which then switches back from the same nesting level, the predictor is not damaged. I think this is worth doing to optimize for coroutines used as generators.

BTW, I measured these effects multiple times.

Regarding populating the register well ahead of the jump, it is an interesting idea, but I do not think it matters much in practice. First of all, jumps and call are 'executed' at the fetch stage in the pipeline, this is about 10 clock cycles earlier than the stage that would execute the pop; because of OoO execution this might even be much earlier. Also performing a jump never has a dependency on anything as it is always executed under speculation, so 'pop rax' being completed or not has no impact. The register is only used much later to confirm or rollback the speculation.

I gave this a try:

https://hastebin.com/raw/ipodigizey

There's no difference between an inline assembler call function and a fully unrolled, inlined thread swap that avoids any calls at all. There's only an infinitesimal difference between these two versions and the libco method of using a byte array and a function thunk to invoke it in a compiler-independent way.

Further testing in bsnes (where tens of millions of thread switches occur per second) reveals absolutely no observable difference in performance.

It's a fun idea to toy around with this, but in the end any of these methods are going to be almost identical. You might as well pick whichever you like and go with that.

One (but not both) of the coroutines need to do a return after invoking coro swap to see the advantage of avoiding the call, otherwise the additional call instruction is pretty much 0 cost.

Inline assembler does have another advantage though: you can use extended asm clobbers instead of saving registers manually. At the very least the compiler can do a better job of scheduling the instruction around (which is admittedly marginal on an OoO CPU), but in some cases (like this specific artificial test) can completely omit them. You should be able to see a coroutine jump every other cycle this way (i.e. the theoretical limit as most CPUs can only perform a taken jump that often).

At the end of the day depends on what you are aiming for: if a task is doing some reasonable amount of work between coroutine jumps (i.e. 'just' light weight threads), the canonical implementation is perfectly fine. My end goal is to make coroutines usable as generators, so probably even inline assembler is not enough there as ideally you would like the compiler to optimize, inline and vectorize across coroutine calls.

Thanks for the detailed reply. It made me feel excited, and I want to benchmark the differences.

Where is your implementation?

User space stack switching libraries is kind of my hobby, so I have actually multiple ones:

// A more canonical implementation: https://github.com/gpderetta/libtask/blob/master/continuatio...

// Aggressively optimized: https://github.com/gpderetta/delimited/blob/master/delimited...

If you take a look, you'll see that I specifically designed the API to avoid the need for any global/thread local state.

So, the argument to `coroutine_transfer` is passed to the coroutine and the calling coroutine is returned from `coroutine_transfer`.

So, using %rax is not possible, because it's the return value. That being said, another register would be fine. So you think there is a performance improvement from putting the return address specifically into %rax as soon as possible?