Y
Hacker News
new
|
ask
|
show
|
jobs
by
amelius
466 days ago
Aren't there any "blind" benchmarks?
2 comments
nathanasmith
466 days ago
Unfortunately that wouldn't help as much as you think since talented AI labs can just watch the public leaderboard and note what models move up and down to deduce and target whatever the hidden benchmark is testing.
link
nickthegreek
466 days ago
OpenRouter Arena Ratings are probably the closet thing.
link