|
|
|
|
|
by portly
29 days ago
|
|
Don't let the best be the enemy of the good. I got amazing performance for swapping for-loops with some simple SIMD patterns. Moreover. By doing this. I noticed that the codebase started to become better shaped for performance as well. By writing SIMD patterns, you get into the mindset of tight, hot loops. |
|
If you wanted to explicitly opt into bundling/batching of operations, you wouldn't actually want to define a fixed register size. You'd want a data type that represents an arbitrarily sized register and exposes some across batch operations. Then the compiler can make use of this mini DSL to optimize your SIMD code to actual instructions.
The problem is solvable, but it requires cooperation from all parties. CPU vendors must offer a basic set of vector instructions that is supported on all architectures. The language committee must be willing to support function local variable size data types that are never exposed in the ABI. The compiler developers must increase the quality of their auto vectorizers.