Hacker News new | ask | show | jobs
by squigz 667 days ago
Is there yet standardized ways of objectively testing LLMs? The Chatbot Arena thing has always felt weird to me; basically ranking them based on vibes.
2 comments

Short answer is no, because there is no 'standardized' use case.

One thing is sure - that current commonly used benchmarks are mostly polluted and worthless. So you have to go to niche ones.

For example the one I check for coding is Aider LLM leaderboard [1].

We maintain Kagi LLM Benchmarking Project [2] optimized for the use case of using LLMs in search.

[1] https://aider.chat/docs/leaderboards/

[2] https://help.kagi.com/kagi/ai/llm-benchmark.html

Not really. There's a hundred benchmarks, but all of them suffer from the same issues. They're rated by other LLMs, and the tasks are often too simple and similar to each other. The hope is that just gathering enough of these benchmarks means you get a representative test suite, but in my view we're still pretty far off.