Hacker News new | ask | show | jobs
by yalok 2685 days ago
Can someone please explain why loop tiling increases performance in JS so dramatically? Is it mainly due to the fact that inner loops have constant size (64) and get called more frequently, and thus get promoted faster into deeper stages of JS runtime optimization?

My guess is that if you try to invoke initial whole code (before tiling) in a external loop (rotating images of exactly the same size), you will get similar perf boost (not that it has practical implication, but just to understand how optimization works).

1 comments

No, it's faster because the working set of 64 * 64 * 4 * 2 bytes can (almost) fit in CPU core L1 cache. Further cache levels are slower and finally the memory is glacially slow.

WASM example would speed up as well using the same approach. Or C, Rust or whatever.

To add background, this is a standard optimization technique that has been employed in eg fortran compilers since at least the 1980s.
Doesn't this rely on the CPU prefetching the memory to cache? Do current CPUs from Intel&AMD detect access patterns like this successfully? I.e. where you're accessing 64-element slices from a bigger array with a specific stride.
The idea is that the Y dimension is going to have a limited nr (here 64) of hot cache lines while a tile is processed. After going through one set of 64 vertical lines, the Y accesses are going to be near the Y accesses from the previous outer-tile-loop iteration.

(Stride detecting prefetch can help, especially on the first iteration of a tile, but is not required for a speedup).

BTW this is the motivation for GPUs (and sometimes other graphics applications) using "swizzled" texture/image formats, where pixels are organised into various kinds of screen-locality preserving clumps. https://fgiesen.wordpress.com/2011/01/17/texture-tiling-and-...