I always wonder when hearing about these old optimizations why they aren't used in contemporary code. Wouldn't you want to squeeze every bit of performance even on modern hardware?
The "processor-memory performance gap" is a big reason why lookup tables aren't as clear of a win on modern hardware as they were on the SNES.
If it takes two CPU cycles to read from RAM, a lookup table will basically always be faster than doing the math at runtime. If it takes fifty cycles (because, while your RAM may be faster, your CPU is a lot faster), and your processor has more advanced hardware that can do more math per cycle, maybe just doing the math is faster.
Other optimizations are now applied automatically by compilers. For example, all modern compilers optimize integer division by compile-time constants, here’s an example: https://godbolt.org/z/1b8r5c5MG
Squeezing performance out of modern hardware requires doing very different things.
Here’s an example about numerical computations. On paper, each core of my CPU can do 64 single-precision FLOPs each cycle. In reality, to achieve that performance a program needs to spam _mm256_fmadd_ps instructions while only loading at most 1 AVX vector per FMA, and only storing at most 1 AVX vector per two FMAs.
Artifacts are ugly. So why force it on modern hardware when GPUs are extremely fast?
For reference: I was doing a path tracer in PHP :) so yeah, that renders like ancient hardware.
(The browser requested different buckets of an image. A PHP script then rendered and returned that bucket. So it was a kind of multi-threading but still very slow.)
If it takes two CPU cycles to read from RAM, a lookup table will basically always be faster than doing the math at runtime. If it takes fifty cycles (because, while your RAM may be faster, your CPU is a lot faster), and your processor has more advanced hardware that can do more math per cycle, maybe just doing the math is faster.