| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by aredox 458 days ago
	The fact those big LLM developers devote a significant amount of effort to game benchmarks is a big show of confidence that they are making progress towards AGI and will recoup those billions of dollars and man-hours/s

2 comments

amelius 458 days ago

Are the benchmark prompts public and isn't that where the problem lies?

link

StevenWaterman 458 days ago

No, even if the benchmarks are private, it's still an issue. Because you can overfit to the benchmark by trying X random variations of the model, and picking the one that performs best on the benchmark

It's similar to how I can pass any multiple-choice exam if you let me keep attempting it and tell me my overall score at the end of each attempt - even if you don't tell me which answers were right/wrong

link

VladVladikoff 458 days ago

Now I’m wondering what the most efficient algorithm to obtain a mark of 100% in the least amount of attempts. Guessing one question per attempt seems inefficient. Perhaps guessing the whole exam as option A. Then submitting the whole exam as option B. And so on, at the start, could give you a count of how many As are correct. Then maybe some sort of binary sort through the rest of the options? You could submit the first 1/2 as A and the second 1/2 as B. Etc. hmmmm

link

amelius 458 days ago

Maybe an llm can tell you how to best approach this problem ;)

link

amelius 458 days ago

Maybe there should be some rate limiting on it then? I.e., once a month you can benchmark your model. Of course you can submit under different names, but how many company names can someone realistically come up with and register?

link

sebastiennight 458 days ago

So now you want OpenAI to go even wilder in how they name each new model?

link

amelius 458 days ago

1 model per company per month, max.

link

leto_ii 458 days ago

Is this sarcasm? Otherwise I'm not sure how that follows. Seems more reasonable to believe that they're hitting walls and switching to PR and productizing.

link

Terr_ 458 days ago

I believe they are being sarcastic, but Poe's Law is in play and it's too ambiguous for practical purposes.

link

RodgerTheGreat 458 days ago

Ending a paragraph with "/s" is a moderately common convention for conveying a sarcastic tone through text.

link