|
|
|
|
|
by Nyan
1420 days ago
|
|
Nicely done. > There is still a lot of room to micro-optimize both the avx and avx64 implementation I personally couldn't see much - perhaps aligning loads and defering `_mm256_madd_epi16` are the only ideas that come to mind.
What did you have in mind? |
|
* An avx96/avx128 version, which requires more care than avx32/avx64 because you will overflow a 16 bit signed number if you simply extend the coefficient vectors from 0..32 to 0..96/128 (e.g. 255*96 + 254*96 > 32767), but looking at it now, I realize you shouldn't actually need more than one 0..32 coefficient vector.
* The chunk length could be longer because there are 8 separate 32 bit counters in each vector, which can be summed into a uint64_t instead of a uint32_t when computing the modulo.
* As you said, aligning the loads and deferring the `_mm256_madd_epi16` outside of the loop. For deferring the madd specifically, using two separate sum2 vectors and splitting the `mad` vector into two by using `_mm256_and_si256(mad, _mm256_set1_epi32(0xFFFF)` and `_mm256_srli(mad, 16)` which should improve upon the 5 cycle latency hit incurred by the madd.
Plus I am sure there are many other opportunities to optimize this I have not thought of :)