|
|
|
|
|
by ajayjain
2165 days ago
|
|
Two years ago, I wrote an LLVM compiler pass that automatically upgrades SSE or AVX-2 code to AVX-512 when those instructions are available. It's helpful for getting performance gains for hand engineered code that you don't want to touch. We saw some good gains on integer workloads (unpacking, useful for databases - something I'd call "regular code"): 1.16x speedup going from SSE to AVX-2, and 1.43x speedup going from SSE to AVX-512. Even better speedups are available for FP workloads, though hand-writing the AVX-512 version can work a lot better since programmers can exploit the full range of new instructions available in AVX. https://www.nextgenvec.org/
https://arxiv.org/abs/1902.02816 |
|
I helped put an early deep network application into production that used one particular generation of vector instructions and it was clear everyone was traumatized at the process of implemented it to use those vector instructions and would never do it again.
We later bought servers that had instructions that we weren't using, but we were so focused on getting the product out the door (which we did) that we left those performance gains on the table.
The consequence of the "new instructions every two years" route is that end users don't really experience the performance benefits engineered into the chips. The vast majority of software development organizations are more concerned about support costs than they are about getting the last bit of performance out. (Also, end users are already tired of 200-megabyte binaries, long start times, and other performance decrements that come from high-complexity libraries that contain a huge number of code paths for different CPUs)
Many working assembly programmers who are good at SIMD love the Intel approach, maybe because it keeps them in business. People are doing very cool things in the parsing space (e.g. it is shocking how slow it is to parse ascii-formatted numbers, JSON, XML, RDF, ...) with SIMD instructions.
However, the old tradition in vector computing is the variable-sized vector unit like what the ILLIAC had or the old Crays, the vector accelerator for the IBM 3090, etc. ARM has the scalable vector instructions which are organized that way.
ARM's Helium had a nice middle ground of having "small, medium and large" implementations available which all run the same code. You might get higher peak peformance on Intel's road, but higher realized performance with ARM.