Not really. They already chose to show the benchmark where it does best and even then it’s still quite a bit worse (though definitely impressive for its size).
If you take a look at other benchmarks, for example MMLU@5-shot then this does 46.3, while gpt-3.5 does 70.
But there might be some use cases where this one is close enough in performance and the difference in cost and speed make it a better choice.
By comparing on benchmarks that are either limited, or have data leaks, or in most cases just don't make sense in terms of usability - I've personally stopped looking at benchmarks to compare models. Personally, if I want to try a new model I hear a lot of chatter about, I use it for a few hours in my daily workflow. My baseline is GPT3.5 and GPT4, and I compare the models with them in terms of my day to day usage.
The LLM field is still messy at large, if you look at the rankings of model performance, they still do not reflect their usability in real life. I think one major challenge is to find a corresponding benchmark.
But there might be some use cases where this one is close enough in performance and the difference in cost and speed make it a better choice.