| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ashdnazg 385 days ago

With AVX512 (and to a lesser extent with AVX2) one can implement 256 bit addition pretty efficiently with the additional benefit of fitting more numbers in registers.

It looks more or less like this:

  __m256i s = _mm256_add_epi64(a, b);
  const __m256i all_ones = _mm256_set1_epi64x(~0);
  int g = _mm256_cmpgt_epu64_mask(a, s);
  int p = _mm256_cmpeq_epu64_mask(s, all_ones);
  int carries = ((g << 1) + p) ^ p;

  __m256i ret = _mm256_mask_sub_epi64(s, carries, s, all_ones);

The throughput even seems to be better: https://godbolt.org/z/e7zETe8xY

It's trivial to change this to do 512 bit addition where the improvement will be even more significant.

1 comments

amitprasad 385 days ago

Note that, especially on certain Intel architectures, using AVX512 instructions _at all_ can result in the whole processor downclocking, and thus ending up resulting in inconsistent / slower overall performance.

https://stackoverflow.com/questions/56852812/simd-instructio...

link

adgjlsfhk1 385 days ago

> using AVX512 instructions _at all_

This isn't correct. AVX512 provides both a bunch of extra instructions, zmm (512 bit) registers, and an extra 16 (for a total of 32) vector registers. The donwnclocking only happens if you use 512 bit registers (not just avx512 instructions). The difference here matters a bunch since there are a bunch of really useful instructions (e.g. 64 bit integer multiply) that are added by avx512 that are pure upside.

Also none of this is an issue on Zen4 or Zen5 since they use much more sensible downlclocking where it will only downclock if you've used enough instructions in a row for it to start spiking power/temp.

link

amitprasad 385 days ago

Ah yes, you’re completely correct :)

General idea was just to highlight some of the dangers of vector registers. I believe the same is true of ymm (256) to a lesser extent.

link