|
|
|
|
|
by nkurz
3114 days ago
|
|
And yet, the combination screams. Yes, I presume it's very fast because of a number of smart design decisions. I would guess that the relatively small on-disk size of executable is a consequence of these decisions, rather than a cause of the high speed. And as you point it, it's really the design of the core interpreter that matters. When Python switched the interpreter loop from a switch to a threaded one, for example, they got ~20% speedup[0]; I wouldn't be surprised if the fitting entirely within the I-cache (which K did and Python didn't at the time) gives another 20% speedup. I'm familiar with this improvement, and talk it up often. Since certain opcodes are more likely to follow other opcodes (even if they are globally rare) threaded dispatch can significantly reduce branch prediction errors. But despite not having measured the number of I-cache misses on the Python benchmarks, I'd be utterly astonished if there were enough of them to allow for a 20% speedup. My guess would be that the potential is something around 1%, but if you can prove that it's more than 10% I'd be excited to help you work on solving it. |
|
The people who surely know what difference it makes today are Nial Dalton and Arthur Whitney.