| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by fulafel 1612 days ago
	There's a presumption here that checked access would cost some nr of nanoseconds per access, but this often isn't the case since predicted, not-taken branches tend to have 0 cycle latency in recent CPUs.

2 comments

celeritascelery 1612 days ago

> predicted, not-taken branches tend to have 0 cycle latency in recent CPUs.

This is not the case. Due to instruction level parallelism, the throughput could be unaffected, but you will always have latency penalty. The CPU still needs to run the check (access the length and compare it to the index) and this adds latency. On top of that, it also increases code size, which can impact the instruction cache and binary size. It’s a small penalty, but it’s not 0.

link

fulafel 1612 days ago

Speculative execution enables continuing along the predicted branch without stopping. You do need to have the ~2 instructions to get the length test input on hand but that usually can be eaten by insn level parallelism without hurting the latency of the array operation.

link

errantmind 1611 days ago

Do you have any evidence of this claim? Perhaps a benchmark?

This doesn't align with any of my performance optimization experience.

link

fulafel 1601 days ago

I went looking, and seems I have to walk my claim back somewhat. Wide issue OoO processors hide a lot of the overhead but not all of it.

link

Jensson 1612 days ago

Depends, if the function is vectorizable then the cpu can do more elements at a time if it doesn't do the branch prediction work. It is true for non-vectorizable work.

link

fulafel 1612 days ago

In autovectorized loops, the generated code typically needs length checks (or static length proofs) to handle tails of vectors. But yes there are still cases where the cost can be measurable.

link