I wonder if the software running those models has finally been corrected[1] to use floating point properly. Giving different results from the same data+software when run on different hardware suggests rounding wasn't being handled properly[2].
As these models are already so sensitive to initial conditions that "ensemble prediction" are necessary to avoid chaotic results, mishandling floating point rounding could completely destroy the results.