| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by vardump 875 days ago
	CORDIC is how it's usually done in hardware (and FPGAs). https://en.wikipedia.org/wiki/CORDIC

3 comments

perihelions 875 days ago

CORDIC is pretty obsolete, AFAIK. Its advantage is that its hardware requirements are absolutely tiny: two (?) accumulator registers, and hardware adders and shift-ers—I think that's all. No multiplication needed, in particular. Very convenient if you're building things from discrete transistors, like the some of those earlier scientific calculators!

(Also has a nice property, apparently, that CORDIC-like routines exist for a bunch of special functions and they're very similar to each other. Does anyone have a good resource for learning the details of those algorithms? They sound elegant).

pclmulqdq 875 days ago

CORDIC still is used in tiny microcontrollers (smaller than Cortex-M0) and in FPGAs when you are very resource-constrained. Restricting the domain and using Chebyshev/Remez is the way to go pretty much everywhere.

idatum 875 days ago

I just recently came across CORDIC in a fun read I just finished, 'Empire of the Sum', by Keith Houston. It's a history of calculators, and there's a chapter on the HP-35 which used CORDIC. The author goes into some details how it was used by that constrained device.

There are a lot of references given in the book including more details on CORDIC.

rcxdude 875 days ago

CORDIC doesn't make sense if multiplies have a similar cost to adds. If you have cheap multiplications then a successive approximation is faster. If multiplications are expensive relative to adds then CORDIC can still generally win. Basically this is only the case nowadays if you are doing ASICs or FPGAs.

derf_ 875 days ago

I have used CORDIC to compute fixed-point log/exp with 64 bits of precision, mostly because I wanted to avoid having to implement (portable) 64x64->128 bit integer multiplication in C. It is probably still slower than doing that, but the code was really simple, and very accurate.

dilyevsky 875 days ago

I think still used out of necessity when hw floating point not available (like fpgas)

f1shy 875 days ago

Multiplication is pretty much needed in cordic! And is far from obsolete! It works perfectly fine, and dont have any of the problems said in the article.

kevin_thibedeau 875 days ago

CORDIC doesn't use multipliers. That's the whole appeal for low performance hardware since it's all shifts and adds. It can still be useful on more capable platforms when you want sin and cos in one operation since there is no extra cost.

perihelions 875 days ago

It uses multiplication by powers of two, which is a floating-point bit shift.

adgjlsfhk1 875 days ago

no it's not. cordic has awful convergence of 1 bit per iteration. pretty much everyone uses power series.

ajross 875 days ago

Actually pretty much everyone implements double precision sin/cos using the same (IIRC) pair of 6th order polynomials. The same SunPro code exists unchnaged in essentially every C library everywehre. It's just a fitted curve, no fancy series definition beyond what appears in the output coefficients. One for the "mostly linear" segment where the line crosses the origin and another for the "mostly parabolic" peak of the curve.

f1shy 875 days ago

That is 64 iterations for a double, that is nothing!

adgjlsfhk1 875 days ago

53, but that's still a lot more than the 5th degree polynomial that you need.

rcxdude 875 days ago

yeah, but 52 adds can be a lot cheaper than a few multiplies, if you're making them out of shift registers and logic gates (or LUT). in a CPU or GPU, who cares, moving around the data is 100x more expensive than the ALU operation.

Const-me 875 days ago

> in a CPU or GPU, who cares, moving around the data is 100x more expensive than the ALU operation

Moving data is indeed expensive, but there’s another reason to not care. Modern CPUs take same time to add or multiply floats.

For example, the computer I’m using, with AMD Zen3 CPU cores, takes 3 cycles to add or multiply numbers, which applies to both 32- and 64-bit flavors of floats. See addps, mulps, addpd, mulpd SSE instructions in that table: https://www.uops.info/table.html

adgjlsfhk1 875 days ago

> moving around the data is 100x more expensive than the ALU operation.

This is exactly the problem with CORDIC. 52 dependent adds requires moving data from a register to the ALU and back 52 times.

rcxdude 875 days ago

It's the problem with CORDIC in that context, yes!

bigbillheck 875 days ago

Why would you ever use CORDIC if you had any other option?

vardump 875 days ago

It's great for hardware implementations, because it's simple and you get good/excellent accuracy. I wouldn't be surprised if that's still how modern x86-64 CPUs compute sin, cos, etc.

That said, last time I had to do that in software, I used Taylor series. Might not have been an optimal solution.

EDIT:

AMD's Zen 4 takes 50-200 cycles (latency) to compute sine. I think that strongly suggests AMD uses CORDIC. https://www.agner.org/optimize/instruction_tables.pdf page 130.

Same for Intel, Tiger Lake (Intel gen 11) has 60-120 cycles of latency. Page 353.

I'd guess usually ~50 cycles for Zen 4 (and ~60 for Intel) for float32, float64/float80 datatype. Denormals might also cost more cycles.

bigbillheck 875 days ago

They switched away from CORDIC at one point: https://www.intel.com/content/www/us/en/developer/articles/t...

(there doesn't seem to actually be a linked article there, just the summary)

vardump 875 days ago

Pretty weird Intel's sine computation latency hasn't changed all that much over the years. Latencies have been pretty similar for 20 years.

EDIT: That's a paper for a software library, not the CPU's internal implementation. Which is probably still done with CORDIC.

bigbillheck 875 days ago

> EDIT: That's a paper for a software library, not the CPU's internal implementation.

Unless you're seeing something I'm not, it's talking about x87, which hasn't been anything other than 'internal' since they stopped selling the 80486sx.

vardump 875 days ago

Ah you're right.

Anyways I wonder why it's still so slow.

60-120 cycles sure looks like a CORDIC implementation, but perhaps not.