Hacker News new | ask | show | jobs
by goosehonk 2389 days ago
It is a major problem to figure out what instructions to use but it's a lot more nuanced than you seem to imply. In the first place you seem to assume that "not running at the max turbo speed that's printed in the marketing literature" is equivalent to "downclocking". However, there are a huge number of reasons why a core might not clock up, including the number of active cores on the package. The first Xeons that shipped with AVX-512 had turbo clocks that were 25% lower than the headline turbo clocks, e.g. 2400MHz instead of 3200MHz. This is still pretty good, and the base clock is 2100MHz.

With the newest Ice Lake processors ("10th generation") the all-cores-active, all-avx-512 max clock speeds are the same as max scalar clock speeds. You can try this out yourself with the avx-turbo program.

1 comments

No, when we used MKL, the workload was slower and turning off MKL made the workload faster. The marketing is irrelevant - using vectorized instructions slowed down the workload in practice which is all that really matters. The Intel teams we were working with explained it as being due to the slower clock speeds caused by vectorized instructions. I don't really know, but it seems fair to assume that they do.

It will be interesting to test Ice Lake when they make it to the cloud, hopefully some time late next year, but until we can actually use Ice Lake, Sky Lake is what AVX512 will be judged on.

It's a good thing you measured it :-) Programs that do a little bit of 512x512 FMA mixed in with other stuff will not benefit from AVX-512 but can suffer from the heat it generates, or from the hiccup when the CPU turns the FMA unit on and back off.

Codes that can do a lot of 512b FMA consecutively will benefit very greatly, and pay a small penalty (up to 25%) in terms of throughput for everything else.

Codes that use non-multiplier stuff that's just marketed as AVX-512, like VBMI2, also benefit greatly and without any penalty.

People with AMD CPUs don't get a choice. Hard to see how this accrues to Intel's mistakes column.

It's not really an Intel mistake, but it is an Intel problem. In ML, the ASICs are coming. NVIDIA is pretty much guaranteed to maintain a leadership position in this space because their software layers are dominant. Intel's ML leadership position is quite tenuous because the killer ML features don't work quite well enough for the premium. MKL should be a solid moat, similar to NVIDIA's CUDA and CUDNN, but if it requires serious effort to get the benefits, it becomes more palatable to spend that effort on ARM-based servers or custom hardware like Inferentia which are meaningfully cheaper. Maybe Ice Lake will fix this, but Intel is running out of time to convince people that Intel chips should remain the first choice in ML.

AMD isn't relevant in this space AFAIK.