Hacker News new | ask | show | jobs
by martinpw 3401 days ago
> I bet if you're using Intel MLK for example, it will probably "magically" get faster on these Skylake machines by using AVX512 automatically.

(Aside, I think you mean MKL.)

There is a 200MHz clock reduction when running AVX512 instructions. If your code makes heavy use of AVX512 there is of course still a big net win, but I'm curious of the impact with more heterogeneous workloads. We have an app that is a mixture of scalar and vector code. Some, but not all, of the vector code would benefit from 512 bit vectors. But how much does the clock slowdown when running this code bleed over into running the other non-AVX512 code? I guess I'm asking how quickly it clocks down, and how quickly the full clock speed is restored. Worst case it seems you could be running full time at a 200MHz slowdown due to blocks AVX512 instructions scattered throughout the application. Is that a valid concern?

1 comments

Thanks for the correction. And FWIW this is an insanely good question and it's hard for me to immediately answer! The things I want AVX512 for are very fat SIMD registers for my cryptographic code, and I've only lightly kept up with it since AVX-512 was pushed off to Skylake-Xeon only. So I haven't worried about highly heterogenous workloads (in my head).

I'd have to look up the specifics; but does AVX512 simply slow the clock, or does it actually have some kind of limited number of hardware ports? I wonder if some clock slowdown would be very much of an issue, since clock-for-clock, you should see better performance on Skylake anyway.

Just curious, what kind of workloads do you think you're looking at here?

I'm having trouble finding good references now, but I am sure I remember reading that is was simply slowing the clock.

In my case, yes, Skylake would still be a win over older hardware, but the question is whether to use AVX512 or not. The workload is a real time animation system with a bunch of nodes in a graph that get evaluated in sequence. Some nodes would benefit from AVX512, but others would not. So the question is, if we vectorize those nodes that would benefit and get a speedup there, will the other unvectorized nodes now run slower as a result of the lower clock speed, canceling out the benefit.

It sounds like your case is a much better fit for AVX512. Out of curiosity, have you tried running on Xeon Phi, which also supports AVX512?