Hacker News new | ask | show | jobs
by kissiel 2806 days ago
I wonder about the Joules per byte. AFAIK AVX units are quite expensive energy-wise.
2 comments

Don't they also tend to work at a lower clock due to their higher energy requirements?

edit: though this is AVX2 ("AVX-256") rather than AVX-512, and Lemire has covered AVX and the possibility of throttling (with or without AVX) in the past so they're probably aware of the potential issue and consider that they either won't get triggered or the gain is good enough to compensate the lower frequency.

Nice. So I understand that AVX2 is not bringing the CPU's clock down.

Got any sources for power consumption figures/comparisons of those AVX units?

Heavy use of complex AVX2 operations causes downclocking, too, but typically less so than AVX-512. More details are documented in https://en.wikichip.org/wiki/intel/frequency_behavior -- also see e.g. https://en.wikichip.org/wiki/intel/xeon_gold/6138#Frequencie... for an example how the frequencies differ depending on the number of active cores.

I think the reason for reducing clock speed when vector units are in heavy use is to keep power usage in check.

You might also find https://blog.cloudflare.com/on-the-dangers-of-intels-frequen... helpful, which goes into detail about a specific case where dynamic frequency scaling resulted in AVX-512 code running slower than AVX2 code.

It's worth noting that the cloudflare test was done on a Xeon Silver, which has worse properties around the frequency changes than the Gold or Platinum. If you're on either Gold or Platinum, you're less likely to suffer the problems that Cloudflare did with mixed workloads.

This seems an optimisation nightmare. Your program needs to be aware both of the capability of the chip for using instructions, and what type of chip it is within a family to decide if you maybe do or don't want to use certain vectored instructions.

The downclocking does not apply at all to simple 256bit bit juggling operations. The code in question should run at full speed.
This doesn't do anything harder than a saturating subtract.
It could well be lower than a scalar approach. SIMD units like AVX are power hungry, but a greater fraction of that power is relevant computation rather than power for control, schedule, etc. Ideally, the constant instruction overhead to get it executing on a functional unit is amortized over the width of the vector.