|
|
|
|
|
by kimixa
667 days ago
|
|
> and it's also in a class of its own in workloads heavy on AVX-512, though they might be a bit niche. It'll be interesting to see if it remains niche - I do a fair bit of work on graphics rendering (some games, some not) and there's quite a bit in avx512 that interests me - even ignoring the wider register width. A lot of pretty common algorithms we use can be expressed a fair bit easier and simpler using some of those features. Previous implementations either weren't available on consumer platforms, or had issues where they would downclock/limit ALU calculation width for some time after an avx512 instruction was run, only returning to full speed after a significant time - presumably when whatever power delivery issues could settle - which seriously affected what use cases in which it made sense. It wasn't worth it to have "small data set" users of avx512, as it would actually run slower than the equivalent avx2 code due to this. And the size of "large enough" data sets was pretty close to where it'll be better to schedule a task on the GPU anyway.... But AMD's implementation doesn't seem to have this problem - so this opens up the instruction set to much more use cases than previous implementations. Or has the AVX512 ship already sailed? With Intel apparently being unable to fix these issues and started hacking it into even smaller bits? I mean, arguably they should have started with that - the register width is probably the least interesting part to me, but at some point having it actually widely adopted might be more useful than a "possibly better" version that no chip actually supports. |
|
Just as a small example from current code, the much more powerful AVX512 byte-granular two register source shuffles (vpermt2b) are very tempting for hashing/lookup table code, turning a current perf bottleneck into something that doesn't even show up in the profiler. And according to (http://www.numberworld.org/blogs/2024_8_7_zen5_avx512_teardo...) Zen5 has not one but _TWO_ of them, at a throughput quadrupling Intel's best effort..