There's a presumption here that checked access would cost some nr of nanoseconds per access, but this often isn't the case since predicted, not-taken branches tend to have 0 cycle latency in recent CPUs.
> predicted, not-taken branches tend to have 0 cycle latency in recent CPUs.
This is not the case. Due to instruction level parallelism, the throughput could be unaffected, but you will always have latency penalty. The CPU still needs to run the check (access the length and compare it to the index) and this adds latency. On top of that, it also increases code size, which can impact the instruction cache and binary size. It’s a small penalty, but it’s not 0.
Speculative execution enables continuing along the predicted branch without stopping. You do need to have the ~2 instructions to get the length test input on hand but that usually can be eaten by insn level parallelism without hurting the latency of the array operation.
Depends, if the function is vectorizable then the cpu can do more elements at a time if it doesn't do the branch prediction work. It is true for non-vectorizable work.
In autovectorized loops, the generated code typically needs length checks (or static length proofs) to handle tails of vectors. But yes there are still cases where the cost can be measurable.
This is not the case. Due to instruction level parallelism, the throughput could be unaffected, but you will always have latency penalty. The CPU still needs to run the check (access the length and compare it to the index) and this adds latency. On top of that, it also increases code size, which can impact the instruction cache and binary size. It’s a small penalty, but it’s not 0.