This. Optimizing performance on BE is a waste of everyone's time and resources.
(But the status quo on BE is that it does a load followed by a byte swap, which is probably pretty cheap anyway. The compiler might even already know how to optimize that into the appropriate LE-load instruction.)