| HN Mirror

Thanks! The tests were run on a single thread since the source kernels & test programs were single threaded. It was important for us to find the largest possible instruction graphs to revectorize at a time so that the whole kernel or most of the kernel runs with the target instruction set. If our search terminated early due to e.g. a pair of low-width instructions that the pass didn't know how to merge, there's a performance loss, so we spent quite a bit of time to synthesize comprehensive lookup tables and implement shuffle merging and reductions. My impression is that kernels running exclusively AVX512 rather than interleaved scalar/low width code generally perform better than the low width code even with downclocking, but please let me know if that's not the case.

We originally ran experiments on a multisocket lab server which should have a faster clock, but I don't have a record of the specs. That server did fine for SSE to AVX2 conversion, but didn't support AVX512 so we ultimately used Google Cloud VMs.