Hacker News new | ask | show | jobs
by ajayjain 2165 days ago
Two years ago, I wrote an LLVM compiler pass that automatically upgrades SSE or AVX-2 code to AVX-512 when those instructions are available. It's helpful for getting performance gains for hand engineered code that you don't want to touch. We saw some good gains on integer workloads (unpacking, useful for databases - something I'd call "regular code"): 1.16x speedup going from SSE to AVX-2, and 1.43x speedup going from SSE to AVX-512. Even better speedups are available for FP workloads, though hand-writing the AVX-512 version can work a lot better since programmers can exploit the full range of new instructions available in AVX.

https://www.nextgenvec.org/ https://arxiv.org/abs/1902.02816

2 comments

The catch is "hand engineered code that you don't want to touch."

I helped put an early deep network application into production that used one particular generation of vector instructions and it was clear everyone was traumatized at the process of implemented it to use those vector instructions and would never do it again.

We later bought servers that had instructions that we weren't using, but we were so focused on getting the product out the door (which we did) that we left those performance gains on the table.

The consequence of the "new instructions every two years" route is that end users don't really experience the performance benefits engineered into the chips. The vast majority of software development organizations are more concerned about support costs than they are about getting the last bit of performance out. (Also, end users are already tired of 200-megabyte binaries, long start times, and other performance decrements that come from high-complexity libraries that contain a huge number of code paths for different CPUs)

Many working assembly programmers who are good at SIMD love the Intel approach, maybe because it keeps them in business. People are doing very cool things in the parsing space (e.g. it is shocking how slow it is to parse ascii-formatted numbers, JSON, XML, RDF, ...) with SIMD instructions.

However, the old tradition in vector computing is the variable-sized vector unit like what the ILLIAC had or the old Crays, the vector accelerator for the IBM 3090, etc. ARM has the scalable vector instructions which are organized that way.

ARM's Helium had a nice middle ground of having "small, medium and large" implementations available which all run the same code. You might get higher peak peformance on Intel's road, but higher realized performance with ARM.

Yep! You got the motivation exactly right. The goal is automatic program "rejuvenation" w/o all the engineering and maintenance costs. Revectorization in the compiler allows programmers to capture much of the performance gain possible with wider instruction sets without having to write and maintain multiple code paths. It can be a big pain to understand and rewrite this vector code -- especially if it's in some external library or another team's code. But agreed that some really nice results are possible with the right engineering such as in parsing.

Do the binaries really get into the 200 MB range from different code paths, or is that due to data?

Cool technique. I read the evaluation and there is afaik no mention if the tests were run multi-threaded or single threaded.

Considering the rather complex dynamic of Intel's downclocking with AVX512 workloads, that would be pretty interesting.

Similarly the AWS VM CPUs were clocked at 2.0Ghz, which is pretty low and possibly does not even downlock in that configuration, were tests run on desktop CPUs that go to 5Ghz regularly on non-AVX512 workloads? How is the advantage there?

Thanks! The tests were run on a single thread since the source kernels & test programs were single threaded. It was important for us to find the largest possible instruction graphs to revectorize at a time so that the whole kernel or most of the kernel runs with the target instruction set. If our search terminated early due to e.g. a pair of low-width instructions that the pass didn't know how to merge, there's a performance loss, so we spent quite a bit of time to synthesize comprehensive lookup tables and implement shuffle merging and reductions. My impression is that kernels running exclusively AVX512 rather than interleaved scalar/low width code generally perform better than the low width code even with downclocking, but please let me know if that's not the case.

We originally ran experiments on a multisocket lab server which should have a faster clock, but I don't have a record of the specs. That server did fine for SSE to AVX2 conversion, but didn't support AVX512 so we ultimately used Google Cloud VMs.