It's worse on English and a lot of other common languages (see Appendix C of the paper). It does better on less common languages like Latvian or Tajik, though.
Which implies, Whisper just hasn't focused on those languages? Seems disingenuous to make the claim that the error rate has halved, when it's worse in the apex language
Edit: Just checked the paper, it seems to be worse[1][2] but feel free to correct me.
I feel like they should've just taken the Whipser architecture, scaled it, and scaled the dataset as they did.
[1] Page: https://i.imgur.com/bq15Tno.png
[2] Paper: https://scontent.fcai19-5.fna.fbcdn.net/v/t39.8562-6/3488279...