| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by dancsi 1586 days ago
	The implementation of the __cos kernel in Musl is actually quite elegant. After reducing the input to the range [-pi/4, pi/4], it just applies the best degree-14 polynomial for approximating the cosine on this interval. It turns out that this suffices for having an error that is less than the machine precision. The coefficients of this polynomial can be computed with the Remez algorithm, but even truncating the Chebyshev expansion is going to yield much better results than any of the methods proposed by the author.

6 comments

l- 1586 days ago

https://github.com/ifduyue/musl/blob/master/src/math/__cos.c

link

mnarayan01 1585 days ago

For anyone else confused by this, the control logic described in the comment happens in https://github.com/ifduyue/musl/blob/master/src/math/cos.c

link

llimllib 1586 days ago

> Input y is the tail of x.

What does "tail" mean in this context?

link

lifthrasiir 1586 days ago

It is a double-double representation [1], where a logical fp number is represented with a sum of two machine fp numbers x (head, larger) and y (tail, smaller). This effectively doubles the mantissa. In the context of musl this representation is produced from the range reduction process [2].

[1] https://en.wikipedia.org/wiki/Quadruple-precision_floating-p...

[2] https://github.com/ifduyue/musl/blob/6d8a515796270eb6cec8a27...

link

jhgb 1586 days ago

Does it make sense to use a double-double input when you only have double output? Sine is Lispchitz-limited by 1 so I don't see how this makes a meaningful difference.

link

lifthrasiir 1586 days ago

The input might be double but the constant pi is not. Let f64(x) be a function from any real number to double, so that an ordinary expression `a + b` actually computes f64(a + b) and so on. Then in general f64(sin(x)) may differ from f64(sin(f64(x mod 2pi))); since you can't directly compute f64(sin(x mod 2pi)), you necessarily need more precision during argument reduction so that f64(sin(x)) = f64(sin(f64timeswhatever(x mod 2pi))).

link

jhgb 1586 days ago

But am I correct in thinking that is at worst a 0.5 ulp error in this case? The lesser term in double-double can't be more than 0.5 ulp of the greater term and sensitivity of both sine and cosine to an error in the input will not be more than 1.

Also, in case of confusion, I was specifically commenting on the function over the [-pi/4, pi/4] domain in https://github.com/ifduyue/musl/blob/master/src/math/__cos.c , which the comment in https://news.ycombinator.com/item?id=30846546 was presumably about.

link

ufo 1586 days ago

I think the polynomial calculation in the end looks interesting. It doesn't use Horner's rule.

link

lifthrasiir 1585 days ago

It does use Horner's rule, but splits the expression into two halves in order to exploit instruction-level parallelism.

link

jacobolus 1585 days ago

Considering the form of both halves is the same, are compilers smart enough to vectorize this code?

link

adgjlsfhk1 1585 days ago

I might be wrong but I would think for something like this vectorizing wouldn't save time (since you would have to move data around before and afterwards. The real benefit of this is it lets you run two fma operations in parallel.

link

brandmeyer 1586 days ago

This has been the standard algorithm used by every libm for decades. Its not special to Musl.

link

enriquto 1586 days ago

But isn't this code rarely called in practice? I guess on intel architectures the compiler just calls the fsin instruction of the cpu.

link

stephencanon 1585 days ago

No. FSIN has accuracy issues as sibling mentions, but is also much slower than a good software implementation (it varies with uArch, but 1 result every ~100 cycles is common; even mediocre scalar software implementations can produce a result every twenty cycles).

link

adgjlsfhk1 1586 days ago

No. The fsin instruction is inaccurate enough to be useless. It gives 0 correct digits when the output is close to 0.

link

enriquto 1585 days ago

> 0 correct digits when the output is close to 0

this is an amusing way to describe the precision of sub-normal floating point numbers

link

adgjlsfhk1 1585 days ago

It's not just sub-normal numbers. As https://randomascii.wordpress.com/2014/10/09/intel-underesti... shows, fsin only uses 66 bits of pi, which means you have roughly no precision whenever abs(sin(x))<10^-16 which is way bigger than the biggest subnormal (10^-307 or so)

link

ramshorns 1585 days ago

In that range, just returning x would be way better. Maybe even perfect actually - if x is less than 10^-16, then the error of x^3/6 is less than the machine precision for x.

link

lifthrasiir 1585 days ago

It is much more amusing if you describe it in ulps; for some inputs the error can reach > 2^90 ulps, more than the mantissa size itself.

link

jhgb 1585 days ago

FSIN only works on x87 registers which you will rarely use on AMD64 systems -- you really want to use at least scalar SSE2 today (since that is whence you receive your inputs as per typical AMD64 calling conventions anyway). Moving data from SSE registers to the FP stack just to calculate FSIN and then moving it back to SSE will probably kill your performance even if your FSIN implementation is good. If you're vectorizing your computation over 4 double floats or 8 single floats in an AVX register, it gets even worse for FSIN.

link

stephencanon 1585 days ago

Moving between x87 and xmm registers is actually fairly cheap (it's through memory, so it's not free, but it's also not _that_ bad). FSIN itself is catastrophically slow.

link

jhgb 1585 days ago

Fair enough, and I imagine there may even be some forwarding going on? There often is when a load follows a store, if I remember correctly. (Of course this will be microarchitecture-dependent.)

link

chrisseaton 1585 days ago

> I guess on intel architectures the compiler just calls the fsin instruction of the cpu.

Do people do that in practice? It's on the FPU, which is basically legacy emulated these days, and it's inaccurate.

link

enriquto 1585 days ago

> the FPU, which is basically legacy

you'll pry my long doubles from my cold, dead hands!

link

jhgb 1585 days ago

The question is if you wouldn't be better served with double-doubles today. You get ~100 bits of mantissa AND you can still vectorize your computations.

link

enriquto 1585 days ago

Sure. There should be a gcc flag to make "long double" become quadruple precision.

The thing is, my first programming language was x86 assembler and the fpu was the funniest part. Spent weeks as a teenager writing almost pure 8087 code. I have a lot of emotional investment in that tiny rolling stack of extended precision floats.

link

chrisseaton 1585 days ago

You're selectively quoting me - I said it's 'legacy emulated'. It's emulated using very long-form microcode, so basically emulated in software. I didn't say it was simply 'legacy'.

link

eyelidlessness 1585 days ago

I’m completely out of my depth reading through these comments so I don’t have much of value to contribute, but I do think I can gently say I think the selective quoting was harmless, but deliberate to fit the shape of a harmless joke’s setup. I don’t think there was any intent to misrepresent your more salient information.

link

lifthrasiir 1585 days ago

x87 trig functions can be very inaccurate due to the Intel's original sloppy implementation and subsequent compatibility requirements [1].

[1] http://nighthacks.com/jag/blog/134/index.html

link

jcelerier 1585 days ago

Wasn't there some blog article a few years ago which showed how glibc's implementation was faster than fsin ?

link

ajross 1585 days ago

That's got nothing to do with Musl per se, that's just the SunPro code that basically every C library uses. I'm sure the polynomials themselves (there's another one for the "crest" of the sin curve, and still more for log and exp and the rest of the family) date from somewhere in computing pre-history. Optimizing polynomial fits to analytic functions was something you could do on extremely early computers.

link

jeremysalwen 1585 days ago

Thanks for explaining this. I actually wrote a SIMD implementation of trig functions years ago, using the techniques you describe.

You can check it out: https://github.com/jeremysalwen/vectrig

I compared several different methods of generating polynomials of different sizes for speed and precision (spoilers: taylor series were the worst and minimax polynomials (Remez algorithm) were the best).

Another (surprising) thing which I learned during the project was that the range reduction was just as (if not more) important to the accuracy of the implementation than the polynomial. If you think about it, you will realize that it's actually pretty difficult to quickly and accurately compute the sin of large numbers like 2^50.

I also tried to directly optimize the coefficients for the accuracy of the polynomial on the required range, but that experiment was unsuccessful.

It's all there in the repository, the implementations, notes about the different polynomials used, and the accuracy/speed statistics for the different methods.

link

jhgb 1585 days ago

> I compared several different methods of generating polynomials of different sizes for speed and precision (spoilers: taylor series were the worst and minimax polynomials (Remez algorithm) were the best).

I would have expected at least an LSQ approximation with a basis of Legendre polynomials thrown into the mix. I got that as a basic homework in my numerics class once, after we've shown to ourselves in the class that [1, x, x², x³...] is not a really good basis to project things onto.

link

bertr4nd 1585 days ago

Are there libraries/tools that people use to do Remez/Chebyshev/etc. function expansions? I can do a basic Taylor series expansion by hand but I’m out of my depth with more sophisticated techniques.

link

adgjlsfhk1 1584 days ago

sollya is the king here

link

bertr4nd 1584 days ago

Thanks!

link

toolslive 1585 days ago

Yup. Chebyshev is the way to go (after 30s of consideration)

link