| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nkurz 3114 days ago

And yet, the combination screams.

Yes, I presume it's very fast because of a number of smart design decisions. I would guess that the relatively small on-disk size of executable is a consequence of these decisions, rather than a cause of the high speed. And as you point it, it's really the design of the core interpreter that matters.

When Python switched the interpreter loop from a switch to a threaded one, for example, they got ~20% speedup[0]; I wouldn't be surprised if the fitting entirely within the I-cache (which K did and Python didn't at the time) gives another 20% speedup.

I'm familiar with this improvement, and talk it up often. Since certain opcodes are more likely to follow other opcodes (even if they are globally rare) threaded dispatch can significantly reduce branch prediction errors. But despite not having measured the number of I-cache misses on the Python benchmarks, I'd be utterly astonished if there were enough of them to allow for a 20% speedup. My guess would be that the potential is something around 1%, but if you can prove that it's more than 10% I'd be excited to help you work on solving it.

1 comments

beagle3 3114 days ago

I am not involved with k, and things might have changed significantly, but around the 2003-2005 timeframe, Arthur had very conclusive benchmarks that showed I-cache residence makes a huge difference (IIRC I-cache was just 8KB those days ...).

The people who surely know what difference it makes today are Nial Dalton and Arthur Whitney.

link

nkurz 3114 days ago

around the 2003-2005 timeframe, Arthur had very conclusive benchmarks that showed I-cache residence makes a huge difference

That sounds quite plausible. The front-end of Intel processors (the parts that deal with making sure there is a queue of instructions ready to execute by the backend) has made some major advances since then. The biggest jumps were probably Nehalem in 2007, and then Sandy Bridge in 2009.

It's not that binary size no longer matters, but you almost have to go out of your way to make instruction cache misses be the tightest bottleneck on a hot path. And when it would be the bottleneck, the branch predictor and prefetch are so good that it's usually only a problem when combined with poor branch prediction, so it really only adds to the delay rather than causing it.

link