| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Workaccount2 280 days ago

The best benchmark is the community vibe in the weeks following a release.

Claude benchmarks poorly but vibes well. Gemini benchmarks well and vibes well. Grok benchmarks well but vibes poorly.

(yes I know you are gushing with anecdotes, the vibes are simply the approximate color of gray born from the countless black and white remarks.)

2 comments

diggan 280 days ago

> The best benchmark is the community vibe in the weeks following a release.

True, just be careful what community you use as a vibe-check. Most of the mainstream/big ones around AI and LLMs basically have influence campaigns run against them, are made of giant hive-minds that all think alike and you need to carefully asses if anything you're reading is true or not, and votes tend to make it even worse.

link

theblazehen 279 days ago

I generally check LM Arena as well as which models have had the most weekly tokens on openrouter

link

wubrr 280 days ago

the vibes are just a collection anecdotes

link

ryoshu 280 days ago

"qual"

link