Hacker News new | ask | show | jobs
by ysleepy 2165 days ago
Cool technique. I read the evaluation and there is afaik no mention if the tests were run multi-threaded or single threaded.

Considering the rather complex dynamic of Intel's downclocking with AVX512 workloads, that would be pretty interesting.

Similarly the AWS VM CPUs were clocked at 2.0Ghz, which is pretty low and possibly does not even downlock in that configuration, were tests run on desktop CPUs that go to 5Ghz regularly on non-AVX512 workloads? How is the advantage there?

1 comments

Thanks! The tests were run on a single thread since the source kernels & test programs were single threaded. It was important for us to find the largest possible instruction graphs to revectorize at a time so that the whole kernel or most of the kernel runs with the target instruction set. If our search terminated early due to e.g. a pair of low-width instructions that the pass didn't know how to merge, there's a performance loss, so we spent quite a bit of time to synthesize comprehensive lookup tables and implement shuffle merging and reductions. My impression is that kernels running exclusively AVX512 rather than interleaved scalar/low width code generally perform better than the low width code even with downclocking, but please let me know if that's not the case.

We originally ran experiments on a multisocket lab server which should have a faster clock, but I don't have a record of the specs. That server did fine for SSE to AVX2 conversion, but didn't support AVX512 so we ultimately used Google Cloud VMs.