|
|
|
|
|
by beagle3
3116 days ago
|
|
> Instruction cache misses are rarely going to be a limiting factor. k's performance is a combination of a lot of small things, each one independently doesn't seem to be that meaningful. And yet, the combination screams. The main interpreter core, for example, used to be <16K code and fit entirely within the I-cache; that means bytecode dispatch was essentially never re-fetched or re-decoded to micro instructions, and all the speculative execution predictors have a super high hit rate. When Python switched the interpreter loop from a switch to a threaded one, for example, they got ~20% speedup[0]; I wouldn't be surprised if the fitting entirely within the I-cache (which K did and Python didn't at the time) gives another 20% speedup. [0] https://bugs.python.org/issue4753 |
|
Yes, I presume it's very fast because of a number of smart design decisions. I would guess that the relatively small on-disk size of executable is a consequence of these decisions, rather than a cause of the high speed. And as you point it, it's really the design of the core interpreter that matters.
When Python switched the interpreter loop from a switch to a threaded one, for example, they got ~20% speedup[0]; I wouldn't be surprised if the fitting entirely within the I-cache (which K did and Python didn't at the time) gives another 20% speedup.
I'm familiar with this improvement, and talk it up often. Since certain opcodes are more likely to follow other opcodes (even if they are globally rare) threaded dispatch can significantly reduce branch prediction errors. But despite not having measured the number of I-cache misses on the Python benchmarks, I'd be utterly astonished if there were enough of them to allow for a 20% speedup. My guess would be that the potential is something around 1%, but if you can prove that it's more than 10% I'd be excited to help you work on solving it.