Hacker News new | ask | show | jobs
by twoodfin 4707 days ago
The article makes the point that the lookup table version is fast because the table fits in D$, and that if the table were evicted it would be slower. This is true, but the more interesting point is that by loading this table into D$, you're potentially slowing down other operations.

It's an important conundrum of optimization that if you had 20 similarly complex functions in a critical path, implementing and benchmarking each individually with a lookup table could show excellent performance while globally performance is terrible. And worse, it's uniformly terrible, with no particular function seeming to be consuming an inordinate amount of the runtime or, for that matter, D$ misses.

1 comments

If you had 20 similar functions, the tables would occupy 5k in total, using only 1/6th of the L1 D$ on a typical "big" CPU. In actuality, temporal locality is such that you don't often stride through all table entries uniformly, so the actual cache pressure is even less.

The point that you're going after is a good one, but its important to keep in mind how enormous modern memory hierarchies are. It often is very reasonable to trade memory and cache pressure for speed.