Interesting. About 20 years ago, it must have been the other way around because I remember this paper [1] where the authors were able to speed up the log function by making use of a lookup table in the CPU cache.
I make things faster all the time by leveraging various CPU caches, sometimes even disk or networked disks. As a general principle though, memory lookups are substantially slower than CPU (and that has indeed changed over time; a decade or three ago they were close to equal), and even cache lookups are fairly comparatively slow, especially when you consider whole-program optimization.
That isn't to say that you can't speed things up with caches, but that you have to be replacing a lot of computations for even very small caches to be practically better (and even very small caches aren't helpful if the whole-program workload is such that you'll have to pull those caches from main RAM each time you use them).
To your paper in particular, their technique still assumes reasonably small caches which you constantly access (so that you never have to reach out to main RAM), even when it was written, and part of what makes it faster is that it's nowhere near as accurate as 1ULP.
Logarithms are interesting because especially across their entire domain they can take 40-120 cycles to compute, more if you're not very careful with the implementation. Modern computers have fairly fast floating-point division and fused multiply-add, so something I often do nowadays is represent them as a ratio of two quadratics (usually rescaling the other math around the problem to avoid the leading coefficient on one of those quadratics) to achieve bounded error in my domain of interest. It's much faster than a LUT (especially when embedded in a larger computation and not easily otherwise parallelizable) and much faster than full-precision solutions. It's also pretty trivially vectorizable in case your problem is amenable to small batches. Other characteristics of your problem might cause you to favor other solutions.
Logarithms are interesting because there's hardware to approximate them built into every modern processor as part of floating point. If you can accept the error, you can abuse it to compute logs with a single FMA.
An example of an exp and a log respectively from my personal library of bit hacks:
It’s a delicate balance and really hard to benchmark. You can write a micro benchmark that keeps the lookup table in cache but what if your function isn’t the only thing being done in a loop? Then even if it’s in the hotpath, there’s insufficient cache to keep the table loaded the entire way through the loop and lookup is slower.
TLDR: it depends on the usage and we actually should have multiple functions that are specialized based on the properties of the caller’s needs where the caller can try a cache or compute approach.
I make things faster all the time by leveraging various CPU caches, sometimes even disk or networked disks. As a general principle though, memory lookups are substantially slower than CPU (and that has indeed changed over time; a decade or three ago they were close to equal), and even cache lookups are fairly comparatively slow, especially when you consider whole-program optimization.
That isn't to say that you can't speed things up with caches, but that you have to be replacing a lot of computations for even very small caches to be practically better (and even very small caches aren't helpful if the whole-program workload is such that you'll have to pull those caches from main RAM each time you use them).
To your paper in particular, their technique still assumes reasonably small caches which you constantly access (so that you never have to reach out to main RAM), even when it was written, and part of what makes it faster is that it's nowhere near as accurate as 1ULP.
Logarithms are interesting because especially across their entire domain they can take 40-120 cycles to compute, more if you're not very careful with the implementation. Modern computers have fairly fast floating-point division and fused multiply-add, so something I often do nowadays is represent them as a ratio of two quadratics (usually rescaling the other math around the problem to avoid the leading coefficient on one of those quadratics) to achieve bounded error in my domain of interest. It's much faster than a LUT (especially when embedded in a larger computation and not easily otherwise parallelizable) and much faster than full-precision solutions. It's also pretty trivially vectorizable in case your problem is amenable to small batches. Other characteristics of your problem might cause you to favor other solutions.