It would be really nice if he posted the difference with/without the optimisation for context. I hope it's going to be included in the explanation post he's planning.
It looks like the code generator is only available for x86 anyway, so it seems niche that way. I am all about baseline being good performance, not the special case.