Hacker News new | ask | show | jobs
by vardump 828 days ago
CORDIC is how it's usually done in hardware (and FPGAs).

https://en.wikipedia.org/wiki/CORDIC

3 comments

CORDIC is pretty obsolete, AFAIK. Its advantage is that its hardware requirements are absolutely tiny: two (?) accumulator registers, and hardware adders and shift-ers—I think that's all. No multiplication needed, in particular. Very convenient if you're building things from discrete transistors, like the some of those earlier scientific calculators!

(Also has a nice property, apparently, that CORDIC-like routines exist for a bunch of special functions and they're very similar to each other. Does anyone have a good resource for learning the details of those algorithms? They sound elegant).

CORDIC still is used in tiny microcontrollers (smaller than Cortex-M0) and in FPGAs when you are very resource-constrained. Restricting the domain and using Chebyshev/Remez is the way to go pretty much everywhere.
I just recently came across CORDIC in a fun read I just finished, 'Empire of the Sum', by Keith Houston. It's a history of calculators, and there's a chapter on the HP-35 which used CORDIC. The author goes into some details how it was used by that constrained device.

There are a lot of references given in the book including more details on CORDIC.

CORDIC doesn't make sense if multiplies have a similar cost to adds. If you have cheap multiplications then a successive approximation is faster. If multiplications are expensive relative to adds then CORDIC can still generally win. Basically this is only the case nowadays if you are doing ASICs or FPGAs.
I have used CORDIC to compute fixed-point log/exp with 64 bits of precision, mostly because I wanted to avoid having to implement (portable) 64x64->128 bit integer multiplication in C. It is probably still slower than doing that, but the code was really simple, and very accurate.
I think still used out of necessity when hw floating point not available (like fpgas)
Multiplication is pretty much needed in cordic! And is far from obsolete! It works perfectly fine, and dont have any of the problems said in the article.
CORDIC doesn't use multipliers. That's the whole appeal for low performance hardware since it's all shifts and adds. It can still be useful on more capable platforms when you want sin and cos in one operation since there is no extra cost.
It uses multiplication by powers of two, which is a floating-point bit shift.
no it's not. cordic has awful convergence of 1 bit per iteration. pretty much everyone uses power series.
Actually pretty much everyone implements double precision sin/cos using the same (IIRC) pair of 6th order polynomials. The same SunPro code exists unchnaged in essentially every C library everywehre. It's just a fitted curve, no fancy series definition beyond what appears in the output coefficients. One for the "mostly linear" segment where the line crosses the origin and another for the "mostly parabolic" peak of the curve.
That is 64 iterations for a double, that is nothing!
53, but that's still a lot more than the 5th degree polynomial that you need.
yeah, but 52 adds can be a lot cheaper than a few multiplies, if you're making them out of shift registers and logic gates (or LUT). in a CPU or GPU, who cares, moving around the data is 100x more expensive than the ALU operation.
> in a CPU or GPU, who cares, moving around the data is 100x more expensive than the ALU operation

Moving data is indeed expensive, but there’s another reason to not care. Modern CPUs take same time to add or multiply floats.

For example, the computer I’m using, with AMD Zen3 CPU cores, takes 3 cycles to add or multiply numbers, which applies to both 32- and 64-bit flavors of floats. See addps, mulps, addpd, mulpd SSE instructions in that table: https://www.uops.info/table.html

> moving around the data is 100x more expensive than the ALU operation.

This is exactly the problem with CORDIC. 52 dependent adds requires moving data from a register to the ALU and back 52 times.

It's the problem with CORDIC in that context, yes!
Why would you ever use CORDIC if you had any other option?
It's great for hardware implementations, because it's simple and you get good/excellent accuracy. I wouldn't be surprised if that's still how modern x86-64 CPUs compute sin, cos, etc.

That said, last time I had to do that in software, I used Taylor series. Might not have been an optimal solution.

EDIT:

AMD's Zen 4 takes 50-200 cycles (latency) to compute sine. I think that strongly suggests AMD uses CORDIC. https://www.agner.org/optimize/instruction_tables.pdf page 130.

Same for Intel, Tiger Lake (Intel gen 11) has 60-120 cycles of latency. Page 353.

I'd guess usually ~50 cycles for Zen 4 (and ~60 for Intel) for float32, float64/float80 datatype. Denormals might also cost more cycles.

They switched away from CORDIC at one point: https://www.intel.com/content/www/us/en/developer/articles/t...

(there doesn't seem to actually be a linked article there, just the summary)

Pretty weird Intel's sine computation latency hasn't changed all that much over the years. Latencies have been pretty similar for 20 years.

EDIT: That's a paper for a software library, not the CPU's internal implementation. Which is probably still done with CORDIC.

> EDIT: That's a paper for a software library, not the CPU's internal implementation.

Unless you're seeing something I'm not, it's talking about x87, which hasn't been anything other than 'internal' since they stopped selling the 80486sx.

Ah you're right.

Anyways I wonder why it's still so slow.

60-120 cycles sure looks like a CORDIC implementation, but perhaps not.