| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by alexhutcheson 672 days ago

If you are accessing elements in memory with a constant stride[1], then hardware prefetchers[2] do a surprisingly good job at “reading ahead” and avoiding cache misses.

A typical example would be: you have an array of objects of constant size, and you’re reading a double field from a constant offset within each object. The hardware prefetcher will “recognize” this access pattern and prefetch that offset every sizeof(obj) bytes.

The major downsides (vs. a struct-of-arrays design with full spatial locality) are:

1. Every prefetch pulls a full cache line, but the cache line will include data you don’t need. In this example, every cache line might have 64 bytes of data, but you only needed the one double field (8 bytes) - the rest is not useful. If you were iterating over an array of doubles you could have pulled 8 double fields in a single cache line.

2. Specific performance is hardware-dependent, so it’s hard to guarantee performance on e.g. low-end cores, short loops, or unusually long strides.

[1] https://en.wikipedia.org/wiki/Stride_of_an_array

[2] https://en.wikipedia.org/wiki/Cache_prefetching