|
|
|
|
|
by nkurz
3114 days ago
|
|
1) Yes, but that's not huge by modern standard. OP could have phrased it better, but I presume his point was that 500KB is extremely small by modern standards. The whole executable fits comfortably in L3, so you'll probably never have a full cache miss for instructions. On the other hand, while it's cool that it's small, I'm not sure that binary size is a good proxy for performance. Instruction cache misses are rarely going to be a limiting factor. |
|
k's performance is a combination of a lot of small things, each one independently doesn't seem to be that meaningful. And yet, the combination screams.
The main interpreter core, for example, used to be <16K code and fit entirely within the I-cache; that means bytecode dispatch was essentially never re-fetched or re-decoded to micro instructions, and all the speculative execution predictors have a super high hit rate.
When Python switched the interpreter loop from a switch to a threaded one, for example, they got ~20% speedup[0]; I wouldn't be surprised if the fitting entirely within the I-cache (which K did and Python didn't at the time) gives another 20% speedup.
[0] https://bugs.python.org/issue4753