Hacker News new | ask | show | jobs
by xyzzyz 89 days ago
On modern machines, looking things up can be slower than recomputing it, when the computation is simple. This is because the memory is much slower than the CPU, which means you can often compute something many times over before the answer from memory arrives.
4 comments

Not just modern machines, the Nintendo64 was memory bound under most circumstances and as such many traditional optimizations (lookup tables, unrolling loops) can be slower on the N64. The unrolling loops case is interesting. Because the cpu has to fetch more instructions this puts more strain on the memory bus.

If curious, On a N64 the graphics chip is also the memory controller so every thing the cpu can do to stay off the memory bus has an additive effect allowing the graphics to do more graphics. This is also why the n64 has weird 9-bit ram, it is so they could use a 18-bit pixel format, only taking two bytes per pixel, for cpu requests the memory controller ignored the 9th bit, presenting a normal 8 bit byte.

They were hoping that by having high speed memory, 250 mHz, the cpu ran at 90mHz, it could provide for everyone and it did ok, there are some very impressive games on the n64. but on most of them the cpu is running fairly light, gotta stay off that memory bus.

https://www.youtube.com/watch?v=xFKFoGiGlXQ (Kaze Emanuar: Finding the BEST sine function for Nintendo 64)

> This is also why the n64 has weird 9-bit ram, it is so they could use a 18-bit pixel format, only taking two bytes per pixel, for cpu requests the memory controller ignored the 9th bit, presenting a normal 8 bit byte.

The Ensoniq EPS sampler (the first version) used 13-bit RAM for sample memory. Why 13 and not 12? Who knows? Possibly because they wanted it "one louder", possibly because the Big Rival in the E-Mu Emulator series used μ-law codecs which have the same effective dynamic range as 13-bit linear.

Anyway you read a normal 16-bit word using the 68000's normal 16-bit instructions but only the upper 13 were actually valid data for the RAM, the rest were tied low. Haha, no code space for you!

The N64 was a particularly unbalanced design for its era so nobody was used to writing code like that yet. Memory bandwidth wasn't a limitation on previous consoles so it's like nobody thought of it.
Unless your lookup table is small enough to only use a portion of your L1 cache and you're calling it so much that the lookup table is never evicted :)
Even that is not necessarily needed, I have gotten major speedups from LUTs even as large as 1MB because the lookup distribution was not uniform. Modern CPUs have high cache associativity and faster transfers between L1 and L2.

L1D caches have also gotten bigger -- as big as 128KB. A Deflate/zlib implementation, for instance, can use a brute force full 32K entry LUT for the 15-bit Huffman decoding on some chips, no longer needing the fast small table.

It's still less space for other things in the L1 cache, isn't it?
Interesting. About 20 years ago, it must have been the other way around because I remember this paper [1] where the authors were able to speed up the log function by making use of a lookup table in the CPU cache.

[1] https://www.researchgate.net/profile/Nikki-Mirghafori/public...

It........depends.

I make things faster all the time by leveraging various CPU caches, sometimes even disk or networked disks. As a general principle though, memory lookups are substantially slower than CPU (and that has indeed changed over time; a decade or three ago they were close to equal), and even cache lookups are fairly comparatively slow, especially when you consider whole-program optimization.

That isn't to say that you can't speed things up with caches, but that you have to be replacing a lot of computations for even very small caches to be practically better (and even very small caches aren't helpful if the whole-program workload is such that you'll have to pull those caches from main RAM each time you use them).

To your paper in particular, their technique still assumes reasonably small caches which you constantly access (so that you never have to reach out to main RAM), even when it was written, and part of what makes it faster is that it's nowhere near as accurate as 1ULP.

Logarithms are interesting because especially across their entire domain they can take 40-120 cycles to compute, more if you're not very careful with the implementation. Modern computers have fairly fast floating-point division and fused multiply-add, so something I often do nowadays is represent them as a ratio of two quadratics (usually rescaling the other math around the problem to avoid the leading coefficient on one of those quadratics) to achieve bounded error in my domain of interest. It's much faster than a LUT (especially when embedded in a larger computation and not easily otherwise parallelizable) and much faster than full-precision solutions. It's also pretty trivially vectorizable in case your problem is amenable to small batches. Other characteristics of your problem might cause you to favor other solutions.

Logarithms are interesting because there's hardware to approximate them built into every modern processor as part of floating point. If you can accept the error, you can abuse it to compute logs with a single FMA.

An example of an exp and a log respectively from my personal library of bit hacks:

    bit_cast<float>((int32_t)(fma(12102203.2f, x, 0x3f800000)));

    bit_cast<float>((uint32_t)(-0x3f800000 - 36707.375f*x)) + 7;
It’s a delicate balance and really hard to benchmark. You can write a micro benchmark that keeps the lookup table in cache but what if your function isn’t the only thing being done in a loop? Then even if it’s in the hotpath, there’s insufficient cache to keep the table loaded the entire way through the loop and lookup is slower.

TLDR: it depends on the usage and we actually should have multiple functions that are specialized based on the properties of the caller’s needs where the caller can try a cache or compute approach.

It may be, especially when it comes to unnecessary cache. But I think `atan` is almost a brute force. Lookup is nothing comparing to that.

Sin/cos must be borders of sqrt(x²+y²). It is also cached indeed.

What do you mean brute force?

We can compute these things using iteration or polynomial approximations (sufficient for 64 bit).

There is a loop of is it close enough or not something like that. It is a brute force. Atan2 purely looks like that to me.
It’s not just guessing, there is theory to prove convergence and rates of converge.

Many algorithms require iteration.

I don't think these functions are programmed as it looks like they are in their Math form. Atan2 is something like a line-to rotating, if the given point is close to any pixel of line-to, returns how many times the line to is rotated. It is almost a motion but an algorithm. This is why I'm telling it is a brute force.
> Sin/cos must be borders of sqrt(x²+y²). It is also cached indeed

This doesn't make a ton of sense.

In what way do you think a sin function is computed? It is something that computed and cached in my opinion.

I think it is stored like sintable[deg]. The degree is index.

> In what way do you think a sin function is computed?

In some way vaguely like this: https://github.com/jeremybarnes/cephes/blob/master/cmath/sin...

> I think it is stored like sintable[deg]. The degree is index.

I can think of a few reasons why this is a bad idea.

1. Why would you use degrees? Pretty much everybody uses and wants radians.

2. What are you going to do about fractional degrees? Some sort of interpretation, right?

3. There's only so much cache available, are you willing to spend multiple kilobytes of it every time you want to calculate a sine? If you're imagining doing this in hardware, there are only so many transistors available, are you willing to spend that many thousands of them?

4. If you're keeping a sine table, why not keep one half the size, and then add a cosine table of equal size. That way you can use double and sum angle formulae to get the original range back and pick up cosine along the way. Reflection formulae let you cut it down even further.

There's a certain train of thought that leads from (2).

a. I'm going to be interpreting values anyway

b. How few support points can I get away with?

c. Are there better choices than evenly spaced points?

d. Wait, do I want to limit myself to polynomials?

Following it you get answers "b: just a handful" and "c: oh yeah!" and "d: you can if you want but you don't have to". Then if you do a bunch of thinking you end up with something very much like what everybody else in these two threads have been talking about.

It isnt good idea to store such values in code. I think it is something that computed when a programming environment is booting up. E.g. when you run "python", or install "python".

I try to understand how Math.sin works. There is Math.cos. It is sin +90 degrees. So not all of them is something that completes a big puzzle.

There's no nice way of saying this, and I mean no malice here, but I think you're exceptionally confused or ignorant, and I don't think it would be rewarding for either of us to continue this conversation.
atan(x) = asin(x / sqrt(1+x*x))
So the asin is brute force. I think it is `atan` function. Written article explains nothing. Sqrt is also like that.

It looks like there is a reasonable explanation when it is written math form but there is no.