Not really, unfortunately, and it’s a pre-existing framework for teaching a class, so simplicity of compilation is extra important. Also if I try to isolate the SIMD bits in C++ I’ll lose the opportunity to have them be inlined which will defeat the optimization purpose.
> Also if I try to isolate the SIMD bits in C++ I’ll lose the opportunity to have them be inlined which will defeat the optimization purpose.
Agreed. Usually the interface would be something like RunEntireAlgorithm(), not DotProduct().
> For those that are new to this, can you give an example of a kind of computation or algorithm which is well-served by your project but not possible with vector extensions
Sure. Vector extensions are OKish for simple math but JPEG XL includes nontrivial cross-lane operations such as
transpose and boundary handling for convolution.
__builtin_shufflevector requires a known vector length, and can be pessimized (fusing two into one general all-to-all permute which is more expensive than two simple shuffles).
Also, vqsort (https://github.com/google/highway/tree/master/hwy/contrib/so...) almost entirely consists of
operations not supported by the extensions, and actually works out of the box on variable-length RISC-V and SVE, which compiler extensions cannot.