Why wouldn't you make key volatile? Shouldn't that solve all the problems? Or is it because it would be to slow because the compiler can't do that many optimizations in the rest of the function any more?
Can you play the game the other way and "fail safe"?
i.e. declare the storage volatile but running your crypto code on a non-volatile ptr to it (obtained via cast) to get your performance back?
If the compiler then generates enough smarts to work out that the non-volatile ptr you've passed into your crypto code is referring to volatile storage, then you keep security but get a (noticeable in testing?) performance hit.
I guess that's not as good as your solution though.
That's discussed in the article. Volatile ultimately applies to the storage, so a sufficiently smart compiler may be able to deduce that you're lying to it with the cast and elide the write.