|
|
|
|
|
by jandrewrogers
1386 days ago
|
|
CPUs have a strong performance bias toward sequential memory access and there are large threshold effects at work here. The block size used is not arbitrary. Improvements in prefetching and cache line utilization can have such large performance benefits that it more than justifies any apparent increase in computational cost because of how the code is organized to obtain those improvements. Most developers do not have a good intuition for the efficiency and scalability of sequential brute-force on modern architectures. There is a cross-over threshold but I think many people would be surprised at how high it is in practice. |
|
I wonder if it would be worth the trouble to code-gen a bunch of variants with things like 8-bit entries, and benchmark to death to determine the optimal cutover points...