Hacker News new | ask | show | jobs
by dancsi 1540 days ago
The implementation of the __cos kernel in Musl is actually quite elegant. After reducing the input to the range [-pi/4, pi/4], it just applies the best degree-14 polynomial for approximating the cosine on this interval. It turns out that this suffices for having an error that is less than the machine precision. The coefficients of this polynomial can be computed with the Remez algorithm, but even truncating the Chebyshev expansion is going to yield much better results than any of the methods proposed by the author.
6 comments

For anyone else confused by this, the control logic described in the comment happens in https://github.com/ifduyue/musl/blob/master/src/math/cos.c
> Input y is the tail of x.

What does "tail" mean in this context?

It is a double-double representation [1], where a logical fp number is represented with a sum of two machine fp numbers x (head, larger) and y (tail, smaller). This effectively doubles the mantissa. In the context of musl this representation is produced from the range reduction process [2].

[1] https://en.wikipedia.org/wiki/Quadruple-precision_floating-p...

[2] https://github.com/ifduyue/musl/blob/6d8a515796270eb6cec8a27...

Does it make sense to use a double-double input when you only have double output? Sine is Lispchitz-limited by 1 so I don't see how this makes a meaningful difference.
The input might be double but the constant pi is not. Let f64(x) be a function from any real number to double, so that an ordinary expression `a + b` actually computes f64(a + b) and so on. Then in general f64(sin(x)) may differ from f64(sin(f64(x mod 2pi))); since you can't directly compute f64(sin(x mod 2pi)), you necessarily need more precision during argument reduction so that f64(sin(x)) = f64(sin(f64timeswhatever(x mod 2pi))).
But am I correct in thinking that is at worst a 0.5 ulp error in this case? The lesser term in double-double can't be more than 0.5 ulp of the greater term and sensitivity of both sine and cosine to an error in the input will not be more than 1.

Also, in case of confusion, I was specifically commenting on the function over the [-pi/4, pi/4] domain in https://github.com/ifduyue/musl/blob/master/src/math/__cos.c , which the comment in https://news.ycombinator.com/item?id=30846546 was presumably about.

I think the polynomial calculation in the end looks interesting. It doesn't use Horner's rule.
It does use Horner's rule, but splits the expression into two halves in order to exploit instruction-level parallelism.
Considering the form of both halves is the same, are compilers smart enough to vectorize this code?
I might be wrong but I would think for something like this vectorizing wouldn't save time (since you would have to move data around before and afterwards. The real benefit of this is it lets you run two fma operations in parallel.
This has been the standard algorithm used by every libm for decades. Its not special to Musl.
But isn't this code rarely called in practice? I guess on intel architectures the compiler just calls the fsin instruction of the cpu.
No. FSIN has accuracy issues as sibling mentions, but is also much slower than a good software implementation (it varies with uArch, but 1 result every ~100 cycles is common; even mediocre scalar software implementations can produce a result every twenty cycles).
No. The fsin instruction is inaccurate enough to be useless. It gives 0 correct digits when the output is close to 0.
> 0 correct digits when the output is close to 0

this is an amusing way to describe the precision of sub-normal floating point numbers

It's not just sub-normal numbers. As https://randomascii.wordpress.com/2014/10/09/intel-underesti... shows, fsin only uses 66 bits of pi, which means you have roughly no precision whenever abs(sin(x))<10^-16 which is way bigger than the biggest subnormal (10^-307 or so)
In that range, just returning x would be way better. Maybe even perfect actually - if x is less than 10^-16, then the error of x^3/6 is less than the machine precision for x.
It is much more amusing if you describe it in ulps; for some inputs the error can reach > 2^90 ulps, more than the mantissa size itself.
FSIN only works on x87 registers which you will rarely use on AMD64 systems -- you really want to use at least scalar SSE2 today (since that is whence you receive your inputs as per typical AMD64 calling conventions anyway). Moving data from SSE registers to the FP stack just to calculate FSIN and then moving it back to SSE will probably kill your performance even if your FSIN implementation is good. If you're vectorizing your computation over 4 double floats or 8 single floats in an AVX register, it gets even worse for FSIN.
Moving between x87 and xmm registers is actually fairly cheap (it's through memory, so it's not free, but it's also not _that_ bad). FSIN itself is catastrophically slow.
Fair enough, and I imagine there may even be some forwarding going on? There often is when a load follows a store, if I remember correctly. (Of course this will be microarchitecture-dependent.)
> I guess on intel architectures the compiler just calls the fsin instruction of the cpu.

Do people do that in practice? It's on the FPU, which is basically legacy emulated these days, and it's inaccurate.

> the FPU, which is basically legacy

you'll pry my long doubles from my cold, dead hands!

The question is if you wouldn't be better served with double-doubles today. You get ~100 bits of mantissa AND you can still vectorize your computations.
Sure. There should be a gcc flag to make "long double" become quadruple precision.

The thing is, my first programming language was x86 assembler and the fpu was the funniest part. Spent weeks as a teenager writing almost pure 8087 code. I have a lot of emotional investment in that tiny rolling stack of extended precision floats.

You're selectively quoting me - I said it's 'legacy emulated'. It's emulated using very long-form microcode, so basically emulated in software. I didn't say it was simply 'legacy'.
I’m completely out of my depth reading through these comments so I don’t have much of value to contribute, but I do think I can gently say I think the selective quoting was harmless, but deliberate to fit the shape of a harmless joke’s setup. I don’t think there was any intent to misrepresent your more salient information.
x87 trig functions can be very inaccurate due to the Intel's original sloppy implementation and subsequent compatibility requirements [1].

[1] http://nighthacks.com/jag/blog/134/index.html

Wasn't there some blog article a few years ago which showed how glibc's implementation was faster than fsin ?
That's got nothing to do with Musl per se, that's just the SunPro code that basically every C library uses. I'm sure the polynomials themselves (there's another one for the "crest" of the sin curve, and still more for log and exp and the rest of the family) date from somewhere in computing pre-history. Optimizing polynomial fits to analytic functions was something you could do on extremely early computers.
Thanks for explaining this. I actually wrote a SIMD implementation of trig functions years ago, using the techniques you describe.

You can check it out: https://github.com/jeremysalwen/vectrig

I compared several different methods of generating polynomials of different sizes for speed and precision (spoilers: taylor series were the worst and minimax polynomials (Remez algorithm) were the best).

Another (surprising) thing which I learned during the project was that the range reduction was just as (if not more) important to the accuracy of the implementation than the polynomial. If you think about it, you will realize that it's actually pretty difficult to quickly and accurately compute the sin of large numbers like 2^50.

I also tried to directly optimize the coefficients for the accuracy of the polynomial on the required range, but that experiment was unsuccessful.

It's all there in the repository, the implementations, notes about the different polynomials used, and the accuracy/speed statistics for the different methods.

> I compared several different methods of generating polynomials of different sizes for speed and precision (spoilers: taylor series were the worst and minimax polynomials (Remez algorithm) were the best).

I would have expected at least an LSQ approximation with a basis of Legendre polynomials thrown into the mix. I got that as a basic homework in my numerics class once, after we've shown to ourselves in the class that [1, x, x², x³...] is not a really good basis to project things onto.

Are there libraries/tools that people use to do Remez/Chebyshev/etc. function expansions? I can do a basic Taylor series expansion by hand but I’m out of my depth with more sophisticated techniques.
sollya is the king here
Thanks!
Yup. Chebyshev is the way to go (after 30s of consideration)