| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by hansvm 96 days ago

It........depends.

I make things faster all the time by leveraging various CPU caches, sometimes even disk or networked disks. As a general principle though, memory lookups are substantially slower than CPU (and that has indeed changed over time; a decade or three ago they were close to equal), and even cache lookups are fairly comparatively slow, especially when you consider whole-program optimization.

That isn't to say that you can't speed things up with caches, but that you have to be replacing a lot of computations for even very small caches to be practically better (and even very small caches aren't helpful if the whole-program workload is such that you'll have to pull those caches from main RAM each time you use them).

To your paper in particular, their technique still assumes reasonably small caches which you constantly access (so that you never have to reach out to main RAM), even when it was written, and part of what makes it faster is that it's nowhere near as accurate as 1ULP.

Logarithms are interesting because especially across their entire domain they can take 40-120 cycles to compute, more if you're not very careful with the implementation. Modern computers have fairly fast floating-point division and fused multiply-add, so something I often do nowadays is represent them as a ratio of two quadratics (usually rescaling the other math around the problem to avoid the leading coefficient on one of those quadratics) to achieve bounded error in my domain of interest. It's much faster than a LUT (especially when embedded in a larger computation and not easily otherwise parallelizable) and much faster than full-precision solutions. It's also pretty trivially vectorizable in case your problem is amenable to small batches. Other characteristics of your problem might cause you to favor other solutions.

1 comments

AlotOfReading 96 days ago

Logarithms are interesting because there's hardware to approximate them built into every modern processor as part of floating point. If you can accept the error, you can abuse it to compute logs with a single FMA.

An example of an exp and a log respectively from my personal library of bit hacks:

    bit_cast<float>((int32_t)(fma(12102203.2f, x, 0x3f800000)));

    bit_cast<float>((uint32_t)(-0x3f800000 - 36707.375f*x)) + 7;

link