|
|
|
|
|
by tomn
459 days ago
|
|
I'm not sure how "it can go the other way around too" -- in that case (assigning to a uint8_t local variable), it seems like that particular optimisation is just not being applied. Interestingly, if the local variable is "volatile uint8_t", the optimisation is applied. Perhaps with an uint8_t local variable and size_t return value, an earlier optimisation removes the cast to uint8_t, because it only has an effect when undefined behaviour has been triggered? It would certainly be interesting to investigate further. In general I agree that being more explicit is better if you really care about performance. It would be great if languages provided more ways to specify this kind of thing. I tried using __builtin_expect to trigger this optimisation too, but no dice. Anyway, thanks for the interesting article. |
|
So the case that you described has 2 layers. The internal std::count_if() layer, which has a 64-bit counter, and the 'return' layer of the count_even_values_v1() function, which has an 8-bit type. In this case, Clang propagates the 8-bit type from the 'return' layer all the way to the inner std::count_if() layer, which effectively means that you're requesting an 8-bit counter, and thus Clang generates the efficient vectorization.
However, say that you have the following 3 layers: (1) internal std::count_if() layer with a 64-bit counter; (2) local 8-bit variable layer, to which the std::count_if() result gets assigned; (3) 'return' layer with a 64-bit type. In this case the 64-bit type from layer 3 gets propagated to the inner std::count_if() layer, which will lead to a poor vectorization. Demo: https://godbolt.org/z/Eo13WKrK4 . So this downwards type-propagation from the outmost layer into the innermost layer doesn't guarantee optimality. In this case, the optimal propagation would've been from layer 2 down to layer 1 and up to layer 3.
Note: I'm not familiar with how the LLVM optimization pass does this exactly, so take this with a huge grain of salt. Perhaps it does indeed 'propagate' the outmost type to the innermost layer. Or perhaps the mere fact that there are more than 2 layers makes the optimization pass not happen at all. Either way, the end result is that the vectorization is poor.