| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by tomn 507 days ago

I'm not sure how "it can go the other way around too" -- in that case (assigning to a uint8_t local variable), it seems like that particular optimisation is just not being applied.

Interestingly, if the local variable is "volatile uint8_t", the optimisation is applied. Perhaps with an uint8_t local variable and size_t return value, an earlier optimisation removes the cast to uint8_t, because it only has an effect when undefined behaviour has been triggered? It would certainly be interesting to investigate further.

In general I agree that being more explicit is better if you really care about performance. It would be great if languages provided more ways to specify this kind of thing. I tried using __builtin_expect to trigger this optimisation too, but no dice.

Anyway, thanks for the interesting article.

2 comments

nicula 507 days ago

> I'm not sure how "it can go the other way around too" -- in that case (assigning to a uint8_t local variable), it seems like that particular optimisation is just not being applied.

So the case that you described has 2 layers. The internal std::count_if() layer, which has a 64-bit counter, and the 'return' layer of the count_even_values_v1() function, which has an 8-bit type. In this case, Clang propagates the 8-bit type from the 'return' layer all the way to the inner std::count_if() layer, which effectively means that you're requesting an 8-bit counter, and thus Clang generates the efficient vectorization.

However, say that you have the following 3 layers: (1) internal std::count_if() layer with a 64-bit counter; (2) local 8-bit variable layer, to which the std::count_if() result gets assigned; (3) 'return' layer with a 64-bit type. In this case the 64-bit type from layer 3 gets propagated to the inner std::count_if() layer, which will lead to a poor vectorization. Demo: https://godbolt.org/z/Eo13WKrK4 . So this downwards type-propagation from the outmost layer into the innermost layer doesn't guarantee optimality. In this case, the optimal propagation would've been from layer 2 down to layer 1 and up to layer 3.

Note: I'm not familiar with how the LLVM optimization pass does this exactly, so take this with a huge grain of salt. Perhaps it does indeed 'propagate' the outmost type to the innermost layer. Or perhaps the mere fact that there are more than 2 layers makes the optimization pass not happen at all. Either way, the end result is that the vectorization is poor.

link

tomn 507 days ago

I've had a look at what's going on in LLVM, and we're both a bit wrong :)

This optimisation is applied by AggressiveInstCombinePass, after the function has been completely inlined. In cases where it is applied, the i64 result of the count is truncated to i8, and this gets propagated to the counter.

In the case where the result is assigned to a local variable, an earlier pass (before inlining) turns a truncate (for the cast) followed by a zero extend (for the return) into an and with 0xff. This persists, and AggressiveInstCombinePass then doesn't propagate this to the counter.

I've posted some selected bits of LLVM IR here:

https://gist.github.com/tomjnixon/d205a56ffc18af499418965ab7...

These come from running clang with "-mllvm=-print-after-all" and grepping for "^define.*_Z20count_even_values_v1RKSt6vectorIhSaIhEE"

This is why i don't see this as an optimisation pass "backfiring" or "go[ing] the other way around" (well, except for the "trunc,extend->and" one which we weren't talking about). Rather, it's just an optimisation not being applied. That might just be a language thing.

link

nicula 507 days ago

Thanks for looking into it!

I modified the footnote to get rid of the misleading statements regarding the 'backfiring' of the optimization. :)

link

fc417fc802 507 days ago

Wouldn't that violate the as-if rule? If you assign to a u8 in layer 2 then the compiler must truncate regardless of the widening of the value upon return. It can't just ignore the narrowing assignment.

link

dzaima 506 days ago

At the very end there's a "movzx eax, dl", i.e. zero-extend the low 8 bits of the accumulated value.

link

tomn 506 days ago

Just to correct my own comment:

> with an uint8_t local variable and size_t return value, an earlier optimisation removes the cast to uint8_t, because it only has an effect when undefined behaviour has been triggered

In this case, there is no undefined behaviour, because a narrowing cast to an unsigned type is well-defined. So, this could never have been a good explanation.

link