Hacker News new | ask | show | jobs
by jandrewrogers 1386 days ago
CPUs have a strong performance bias toward sequential memory access and there are large threshold effects at work here. The block size used is not arbitrary. Improvements in prefetching and cache line utilization can have such large performance benefits that it more than justifies any apparent increase in computational cost because of how the code is organized to obtain those improvements.

Most developers do not have a good intuition for the efficiency and scalability of sequential brute-force on modern architectures. There is a cross-over threshold but I think many people would be surprised at how high it is in practice.

1 comments

I would have expected 2048 items to be the optimal cutover point, because with 16-bit entries that would result in 4KB arrays. Those fit exactly into a typical CPU memory page, whereas 8KB requires two. In the worst case that might double the page table overheads, e.g.: for random point reads.

I wonder if it would be worth the trouble to code-gen a bunch of variants with things like 8-bit entries, and benchmark to death to determine the optimal cutover points...

I think if you go down this road, you'll start to see differences depending on the brand and model of CPU you use. Essentially, you're going into the territory of FFTW's "planner" scheme.