Your comment was really interesting to read. Do you think you can unpack it a little more for someone like me who sits on top of a much higher level of abstraction?
I'm not sure how much deeper I can go ... in essence an x86 instruction like "jsr *N(r)" is really 4 micro ops: "ld tmp, N(r); st pc, (sp); sub sp, sp, 4; jmp (tmp)" - in RISC-V its more like "ld tmp, N(r); jalr (tmp)" - we can ignore the "st pc, (sp); sub sp, sp, 4" for the moment because they don't slow you from executing that first piece of code in the method you're calling, and it kind of lets us compare apples with apples.
So you need a memory fetch followed by an indirect jump, the results from the memory fetch come at the very end of the cpu's pipelines, if the CPU is simple it wont fetch the next instruction until it knows the value of tmp.
However any modern high end CPU is going to guess ('predict') the destination of the jump and start executing code from that predicted destination, if it guesses wrongly those instructions and their results will have to be discarded (a "pipe-flush"). There tend to be two sorts of predictors - for conditional branches and for indirect branches, the conditional ones tend to have a better hit rate, the indirect ones (this case) always fail on the first attempt and tend to be broken by things like a random mix of function pointers in vtables (to be fair the same can probably be said for using conditional branches in a similar situation)
In RISC-V the compiler can still schedule the "ld tmp, N(r)" earlier in the instruction stream, not so the x86. However if you use a conditional branch (an if statement rather than an indirect call) you can move those instructions earlier into the instruction stream and tolerate load delays and branches can be resolved earlier in the pipe (meaning a pipe flush flushes fewer instructions).
Modern speculative CPUs are very dynamic things, a lot of it designed to ameliorate those load delays, sometimes they are a couple of clocks (from an L1 cache) other times they are 100s (from dram) by finding other stuff to do in the mean time. That means that real-world performance measurement can be a bit mushy because there's so much going on at once
So you need a memory fetch followed by an indirect jump, the results from the memory fetch come at the very end of the cpu's pipelines, if the CPU is simple it wont fetch the next instruction until it knows the value of tmp.
However any modern high end CPU is going to guess ('predict') the destination of the jump and start executing code from that predicted destination, if it guesses wrongly those instructions and their results will have to be discarded (a "pipe-flush"). There tend to be two sorts of predictors - for conditional branches and for indirect branches, the conditional ones tend to have a better hit rate, the indirect ones (this case) always fail on the first attempt and tend to be broken by things like a random mix of function pointers in vtables (to be fair the same can probably be said for using conditional branches in a similar situation)
In RISC-V the compiler can still schedule the "ld tmp, N(r)" earlier in the instruction stream, not so the x86. However if you use a conditional branch (an if statement rather than an indirect call) you can move those instructions earlier into the instruction stream and tolerate load delays and branches can be resolved earlier in the pipe (meaning a pipe flush flushes fewer instructions).
Modern speculative CPUs are very dynamic things, a lot of it designed to ameliorate those load delays, sometimes they are a couple of clocks (from an L1 cache) other times they are 100s (from dram) by finding other stuff to do in the mean time. That means that real-world performance measurement can be a bit mushy because there's so much going on at once