Hacker News new | ask | show | jobs
by ImHereToVote 827 days ago
I always wonder when hearing about these old optimizations why they aren't used in contemporary code. Wouldn't you want to squeeze every bit of performance even on modern hardware?
5 comments

The "processor-memory performance gap" is a big reason why lookup tables aren't as clear of a win on modern hardware as they were on the SNES.

If it takes two CPU cycles to read from RAM, a lookup table will basically always be faster than doing the math at runtime. If it takes fifty cycles (because, while your RAM may be faster, your CPU is a lot faster), and your processor has more advanced hardware that can do more math per cycle, maybe just doing the math is faster.

I think this is the only answer that addresses the issue. We always underestimate the cost of a read. and overestimate the cost of a compute.
Some of these old optimizations are now deprecated. For example, there’s a famous trick for inverse square root: https://en.wikipedia.org/wiki/Fast_inverse_square_root Modern processors have a special instruction for that. The hardware instruction is several times faster, and couple orders of magnitude more precise: https://www.felixcloutier.com/x86/rsqrtps

Other optimizations are now applied automatically by compilers. For example, all modern compilers optimize integer division by compile-time constants, here’s an example: https://godbolt.org/z/1b8r5c5MG

Squeezing performance out of modern hardware requires doing very different things.

Here’s an example about numerical computations. On paper, each core of my CPU can do 64 single-precision FLOPs each cycle. In reality, to achieve that performance a program needs to spam _mm256_fmadd_ps instructions while only loading at most 1 AVX vector per FMA, and only storing at most 1 AVX vector per two FMAs.

Artifacts are ugly. So why force it on modern hardware when GPUs are extremely fast?

For reference: I was doing a path tracer in PHP :) so yeah, that renders like ancient hardware.

(The browser requested different buckets of an image. A PHP script then rendered and returned that bucket. So it was a kind of multi-threading but still very slow.)

A lot get antiquated by instruction additions, like the infamous inverse square root
Doing so costs time/wage dollars.