| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kimixa 714 days ago

> and it's also in a class of its own in workloads heavy on AVX-512, though they might be a bit niche.

It'll be interesting to see if it remains niche - I do a fair bit of work on graphics rendering (some games, some not) and there's quite a bit in avx512 that interests me - even ignoring the wider register width. A lot of pretty common algorithms we use can be expressed a fair bit easier and simpler using some of those features.

Previous implementations either weren't available on consumer platforms, or had issues where they would downclock/limit ALU calculation width for some time after an avx512 instruction was run, only returning to full speed after a significant time - presumably when whatever power delivery issues could settle - which seriously affected what use cases in which it made sense. It wasn't worth it to have "small data set" users of avx512, as it would actually run slower than the equivalent avx2 code due to this. And the size of "large enough" data sets was pretty close to where it'll be better to schedule a task on the GPU anyway....

But AMD's implementation doesn't seem to have this problem - so this opens up the instruction set to much more use cases than previous implementations.

Or has the AVX512 ship already sailed? With Intel apparently being unable to fix these issues and started hacking it into even smaller bits? I mean, arguably they should have started with that - the register width is probably the least interesting part to me, but at some point having it actually widely adopted might be more useful than a "possibly better" version that no chip actually supports.

2 comments

Remnant44 714 days ago

I agree. I work in a similar field, and the value of AVX512 is clearly there - it just hasn't been worth implementing for the tiny percentage of market penetration. This is directly due to the market segmentation strategy Intel applied. AMD has raised the ante for AVX512 with two excellent implementations in a row, and for the first time ever I'm definitely considering building AVX512 targets.

Just as a small example from current code, the much more powerful AVX512 byte-granular two register source shuffles (vpermt2b) are very tempting for hashing/lookup table code, turning a current perf bottleneck into something that doesn't even show up in the profiler. And according to (http://www.numberworld.org/blogs/2024_8_7_zen5_avx512_teardo...) Zen5 has not one but _TWO_ of them, at a throughput quadrupling Intel's best effort..

link

kvemkon 714 days ago

> But AMD's implementation doesn't seem to have this problem

From an article:

> Does Zen5 throttle under AVX512?

> Yes it does. Intel couldn't get away from this, and neither can AMD. Laws of physics are the laws of physics.

> The difference is how AMD does the throttling ...

Further details in the article [1].

Discussed here on HN: [2], [3].

[1] https://www.numberworld.org/blogs/2024_8_7_zen5_avx512_teard...

[2] https://news.ycombinator.com/item?id=41182395

[3] https://news.ycombinator.com/item?id=41248260

link

kimixa 712 days ago

Indeed, the difference does appear to be in how AMD does the throttling.

From the linked numberworld blog:

> Thus on Zen4 and Zen5, there is no drawback to "sprinkling" small amounts of AVX512 into otherwise scalar code. They will not throttle the way that Intel does.

This is exactly the use case I'm talking about - relatively small chunks of avx512-using code spread throughout the codebase. Larger chunks of work tend to be worth passing over to the GPU already.

link