|
|
|
|
|
by ashdnazg
385 days ago
|
|
With AVX512 (and to a lesser extent with AVX2) one can implement 256 bit addition pretty efficiently with the additional benefit of fitting more numbers in registers. It looks more or less like this: __m256i s = _mm256_add_epi64(a, b);
const __m256i all_ones = _mm256_set1_epi64x(~0);
int g = _mm256_cmpgt_epu64_mask(a, s);
int p = _mm256_cmpeq_epu64_mask(s, all_ones);
int carries = ((g << 1) + p) ^ p;
__m256i ret = _mm256_mask_sub_epi64(s, carries, s, all_ones);
The throughput even seems to be better: https://godbolt.org/z/e7zETe8xYIt's trivial to change this to do 512 bit addition where the improvement will be even more significant. |
|
https://stackoverflow.com/questions/56852812/simd-instructio...