Hacker News new | ask | show | jobs
by gpderetta 2937 days ago
> I really don't see how "pop rax; jmp rax" is going to be faster than "ret".

As I mentioned earlier this is due to jump predictions. jmp rax (or any register really) uses the indirect jump prediction, while ret uses a special dedicated predictor that uses a stack (the Return Stack Buffer or RSB), populated by call instructions, to predict the return address. In the coroutine case, the ret does not jump to the address of the last call so it will be mispredicted many time. The general indirect predictor, while not guaranteed to get it right, has at least a chance to predict the target especially when you have a small set of fibers calling each other in a deterministic sequence. In the general case with dozens of fibers calling each others from random locations, there is no chance of successful prediction in any case.

In principle a CPU could have a meta predictor that would chose between the general indirect predictor and the RSB, but this does not appear to be the case for current CPUs (the only switch to indirect when the stack buffer underflows).

Fibers are generally problematic for the return call predictor, even if the pop/jmp sequence is used, the call into the coroutine switch function will permanently damage the RBS, so any subsequent ret will be eventually mispredicted. For example:

    Fiber 1:
        call foo:
            call coro_switch
               jump fiber 2
    Fiber 2:
        call coro_switch
            jump fiber 1
    Fiber 1
            ret // return from foo call, always mispredicted
In my coroutine library I have been experimenting with writing coro_switch in inline assembler, so there is no call to it (this only works if your target language support inline assembler). So if a fiber switches to another fiber (from any function nesting depth) which then switches back from the same nesting level, the predictor is not damaged. I think this is worth doing to optimize for coroutines used as generators.

BTW, I measured these effects multiple times.

Regarding populating the register well ahead of the jump, it is an interesting idea, but I do not think it matters much in practice. First of all, jumps and call are 'executed' at the fetch stage in the pipeline, this is about 10 clock cycles earlier than the stage that would execute the pop; because of OoO execution this might even be much earlier. Also performing a jump never has a dependency on anything as it is always executed under speculation, so 'pop rax' being completed or not has no impact. The register is only used much later to confirm or rollback the speculation.

2 comments

I gave this a try:

https://hastebin.com/raw/ipodigizey

There's no difference between an inline assembler call function and a fully unrolled, inlined thread swap that avoids any calls at all. There's only an infinitesimal difference between these two versions and the libco method of using a byte array and a function thunk to invoke it in a compiler-independent way.

Further testing in bsnes (where tens of millions of thread switches occur per second) reveals absolutely no observable difference in performance.

It's a fun idea to toy around with this, but in the end any of these methods are going to be almost identical. You might as well pick whichever you like and go with that.

One (but not both) of the coroutines need to do a return after invoking coro swap to see the advantage of avoiding the call, otherwise the additional call instruction is pretty much 0 cost.

Inline assembler does have another advantage though: you can use extended asm clobbers instead of saving registers manually. At the very least the compiler can do a better job of scheduling the instruction around (which is admittedly marginal on an OoO CPU), but in some cases (like this specific artificial test) can completely omit them. You should be able to see a coroutine jump every other cycle this way (i.e. the theoretical limit as most CPUs can only perform a taken jump that often).

At the end of the day depends on what you are aiming for: if a task is doing some reasonable amount of work between coroutine jumps (i.e. 'just' light weight threads), the canonical implementation is perfectly fine. My end goal is to make coroutines usable as generators, so probably even inline assembler is not enough there as ideally you would like the compiler to optimize, inline and vectorize across coroutine calls.

Thanks for the detailed reply. It made me feel excited, and I want to benchmark the differences.

Where is your implementation?

User space stack switching libraries is kind of my hobby, so I have actually multiple ones:

// A more canonical implementation: https://github.com/gpderetta/libtask/blob/master/continuatio...

// Aggressively optimized: https://github.com/gpderetta/delimited/blob/master/delimited...