| HN Mirror

Not sure if any of these would result in meaningful performance gains, but a few ideas I had:

* An avx96/avx128 version, which requires more care than avx32/avx64 because you will overflow a 16 bit signed number if you simply extend the coefficient vectors from 0..32 to 0..96/128 (e.g. 255*96 + 254*96 > 32767), but looking at it now, I realize you shouldn't actually need more than one 0..32 coefficient vector.

* The chunk length could be longer because there are 8 separate 32 bit counters in each vector, which can be summed into a uint64_t instead of a uint32_t when computing the modulo.

* As you said, aligning the loads and deferring the `_mm256_madd_epi16` outside of the loop. For deferring the madd specifically, using two separate sum2 vectors and splitting the `mad` vector into two by using `_mm256_and_si256(mad, _mm256_set1_epi32(0xFFFF)` and `_mm256_srli(mad, 16)` which should improve upon the 5 cycle latency hit incurred by the madd.

Plus I am sure there are many other opportunities to optimize this I have not thought of :)