Hacker News new | ask | show | jobs
by solidasparagus 2386 days ago
Uhh source on AVX512 not downclocking on modern CPUs? We benchmarked ML workloads on the newest chips the cloud had to offer and the slowdown was a significant problem because, as the parent comment said, it is very hard to reason about whether the benefits of vectorized ops will outweigh the the reduced clock speed. Sometimes it does and sometimes it does not - which is a major problem when you have to specify instruction set when you build the ML library from source.

Maybe you know something I don’t but that FPGA statement makes zero sense to me. The ASIC development cycle is measured in years - that’s why FPGA’s are valuable (and I thought they were relatively heavily used).

3 comments

It is a major problem to figure out what instructions to use but it's a lot more nuanced than you seem to imply. In the first place you seem to assume that "not running at the max turbo speed that's printed in the marketing literature" is equivalent to "downclocking". However, there are a huge number of reasons why a core might not clock up, including the number of active cores on the package. The first Xeons that shipped with AVX-512 had turbo clocks that were 25% lower than the headline turbo clocks, e.g. 2400MHz instead of 3200MHz. This is still pretty good, and the base clock is 2100MHz.

With the newest Ice Lake processors ("10th generation") the all-cores-active, all-avx-512 max clock speeds are the same as max scalar clock speeds. You can try this out yourself with the avx-turbo program.

No, when we used MKL, the workload was slower and turning off MKL made the workload faster. The marketing is irrelevant - using vectorized instructions slowed down the workload in practice which is all that really matters. The Intel teams we were working with explained it as being due to the slower clock speeds caused by vectorized instructions. I don't really know, but it seems fair to assume that they do.

It will be interesting to test Ice Lake when they make it to the cloud, hopefully some time late next year, but until we can actually use Ice Lake, Sky Lake is what AVX512 will be judged on.

It's a good thing you measured it :-) Programs that do a little bit of 512x512 FMA mixed in with other stuff will not benefit from AVX-512 but can suffer from the heat it generates, or from the hiccup when the CPU turns the FMA unit on and back off.

Codes that can do a lot of 512b FMA consecutively will benefit very greatly, and pay a small penalty (up to 25%) in terms of throughput for everything else.

Codes that use non-multiplier stuff that's just marketed as AVX-512, like VBMI2, also benefit greatly and without any penalty.

People with AMD CPUs don't get a choice. Hard to see how this accrues to Intel's mistakes column.

It's not really an Intel mistake, but it is an Intel problem. In ML, the ASICs are coming. NVIDIA is pretty much guaranteed to maintain a leadership position in this space because their software layers are dominant. Intel's ML leadership position is quite tenuous because the killer ML features don't work quite well enough for the premium. MKL should be a solid moat, similar to NVIDIA's CUDA and CUDNN, but if it requires serious effort to get the benefits, it becomes more palatable to spend that effort on ARM-based servers or custom hardware like Inferentia which are meaningfully cheaper. Maybe Ice Lake will fix this, but Intel is running out of time to convince people that Intel chips should remain the first choice in ML.

AMD isn't relevant in this space AFAIK.

Djb’s article linked above answers this in detail. In short, vectorized instructions don’t have to hit ram to do things like addition, but they can heat up the processor beyond its ability to cool to operating temperatures. The clock speed drop is necessary to avoid overheating.

For more reading, check out cpu pipelining, as well as how vectorized instructions actually work. The performance benefit for well implemented vectorized instructions overcomes the clock speed hit by leaps and bounds, which is why mobile systems make such heavy use of them, for example.

FPGAs are in a tough place. Like OP said, most people writing RTL make asics, or at least an asic that's programmable. The FPGA target market is getting slimmer, since we have programmable Asics, like GPUs and tpus, that are as performant with easier programming. They will still serve a niche market, but the "write c++ and run on an fpga" will likely never take off.
I thought the main market for FPGAs was that period between "we have a problem that needs custom hardware" and "we have custom hardware being produced at the scale we need". I guess that's a relatively niche market?
There's also the market for high performance things that need to be in-service upgradeable. I understand mobile base stations are a significant customer for large FPGAs, to enable deployment of new standards revisions/modulation schemes without a truck roll.
Pretty much. The problem set they're useful for is low-latency high-throughput stuff, and/or connectivity to high speed digital signals, for things that there isn't an existing custom solution and where you don't care about area or power consumption. That's not a huge market.

We do use them at my employer, a multinational chip company - but only in very small numbers, like one $50k board gets shared around project groups who use it for a few weeks each. Most of the work is done in simulation.