On most of their tests gpt-4 is not actually worse [1]. In particular coding results are affected by changed due to different output format rather than worse abilities [2]. But that's ok because the message of the paper is that there is strong drift between versions and developers should be aware of it, not that gpt becomes worse [3].