Hacker News new | ask | show | jobs
by aseipp 3401 days ago
Simple process improvements that make Skylake faster will just be "there" -- some code will just be faster with no changes!

For the numerical, AVX stuff -- it should be mostly automatic for you, if you're already using optimized libraries at the core -- MLK, BLAS, that kind of stuff. They'll be transparently upgraded for you -- ideally -- to take care of these things. They normally check what your CPU is at runtime, and pick the fastest implementation among a few different choices it has.

You will need toolchains to support this all, but for the most part that likely won't be a burden unless you want to get your hands dirty and start it yourself -- inevitably, this should all mostly be "pre-canned". Your optimized linear algebra, vector, and math libraries are what will mostly concern themselves with this, not you necessarily. In fact, several of the things already available can probably use these new extensions! I bet if you're using Intel MLK for example, it will probably "magically" get faster on these Skylake machines by using AVX512 automatically.

If you want to understand more: you can always go grab an SSE/AVX reference, check your /proc/cpuinfo, and write a few simple things on your own to get a feel. Your toolchain will definitely support it :)

1 comments

> I bet if you're using Intel MLK for example, it will probably "magically" get faster on these Skylake machines by using AVX512 automatically.

(Aside, I think you mean MKL.)

There is a 200MHz clock reduction when running AVX512 instructions. If your code makes heavy use of AVX512 there is of course still a big net win, but I'm curious of the impact with more heterogeneous workloads. We have an app that is a mixture of scalar and vector code. Some, but not all, of the vector code would benefit from 512 bit vectors. But how much does the clock slowdown when running this code bleed over into running the other non-AVX512 code? I guess I'm asking how quickly it clocks down, and how quickly the full clock speed is restored. Worst case it seems you could be running full time at a 200MHz slowdown due to blocks AVX512 instructions scattered throughout the application. Is that a valid concern?

Thanks for the correction. And FWIW this is an insanely good question and it's hard for me to immediately answer! The things I want AVX512 for are very fat SIMD registers for my cryptographic code, and I've only lightly kept up with it since AVX-512 was pushed off to Skylake-Xeon only. So I haven't worried about highly heterogenous workloads (in my head).

I'd have to look up the specifics; but does AVX512 simply slow the clock, or does it actually have some kind of limited number of hardware ports? I wonder if some clock slowdown would be very much of an issue, since clock-for-clock, you should see better performance on Skylake anyway.

Just curious, what kind of workloads do you think you're looking at here?

I'm having trouble finding good references now, but I am sure I remember reading that is was simply slowing the clock.

In my case, yes, Skylake would still be a win over older hardware, but the question is whether to use AVX512 or not. The workload is a real time animation system with a bunch of nodes in a graph that get evaluated in sequence. Some nodes would benefit from AVX512, but others would not. So the question is, if we vectorize those nodes that would benefit and get a speedup there, will the other unvectorized nodes now run slower as a result of the lower clock speed, canceling out the benefit.

It sounds like your case is a much better fit for AVX512. Out of curiosity, have you tried running on Xeon Phi, which also supports AVX512?