Is there yet standardized ways of objectively testing LLMs? The Chatbot Arena thing has always felt weird to me; basically ranking them based on vibes.
Not really. There's a hundred benchmarks, but all of them suffer from the same issues. They're rated by other LLMs, and the tasks are often too simple and similar to each other. The hope is that just gathering enough of these benchmarks means you get a representative test suite, but in my view we're still pretty far off.
One thing is sure - that current commonly used benchmarks are mostly polluted and worthless. So you have to go to niche ones.
For example the one I check for coding is Aider LLM leaderboard [1].
We maintain Kagi LLM Benchmarking Project [2] optimized for the use case of using LLMs in search.
[1] https://aider.chat/docs/leaderboards/
[2] https://help.kagi.com/kagi/ai/llm-benchmark.html