|
|
|
|
|
by hairtuq
561 days ago
|
|
The author wonders: > In theory at least, the compiler can see that rule only has 256 values and create a reduced version of ca1d_rule_apply for each value. Whether it actually does is not of much practical concern when the rendering code is the bottle neck. However it’s interesting to see if the compiler can deduce the best solution or whether anything trips it up. The compiler is unlikely to get the optimal result here. The core of this is finding the best instruction sequence for a ternary boolean operation encoded in 8 bits; it's the same job needed for emulating the AVX512F "vpternlog" instruction. This can always be done in at most 5 instructions (or 4 if you have andnot/ornot/xornot), but it's not straightforward to do this. Here is some code that calculates optimal instruction sequences (by letting z3 do the heavy lifting): https://github.com/falk-hueffner/ternary-logic-optimization |
|