|
|
|
|
|
by RaisingSpear
861 days ago
|
|
> So far I couldn't exceed 1 GB/s/core [4], so more research is needed. If you have any ideas - I am all ears! I don't know much about the space you're working in, but some things I'd point out: * 32b and 64b SIMD multiplication have very high latency on Intel CPUs (actually, expect multiplication to have high latency in general on any CPU). If you're not using multiple independent chains of zmm registers, performance will suck * _mm512_set_epi64 with memory sources likely performs poorly. See if you can do full vector loads and swizzle the data into position (i.e. unpack/shuffle operations) * do you really need to split the string into 4 parts? Can you just use the same "part", but offset each lane by a byte? * what about CRC64? Like 32/64b multiplication, CLMUL/PMULL instructions tend to have high latency, so might not help on that front, but there may be other benefits |
|