| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jo909 600 days ago

> Couldn't LLM provider just fine-tune their model for these tasks specifically - since they are static - to get ad value?

They could. They would easily be found out as they loose in real world usage or improved new unique benchmarks.

If you were in charge of a large and well funded model, would you rather pay people to find and "cheat" on LLM benchmarks by training on them, or would you pay people to identify benchmarks and make reasonably sure they specifically get excluded from training data?

I would exclude them as well as possible so I get feedback on how "real" any model improvement is. I need to develop real world improvements in the end, and any short term gain in usage by cheating in benchmarks seems very foolish.

5 comments

gloosx 600 days ago

It sounds very nice, but at the same time very naive, sorry. Funding is not a gift, and they must make money. The more funding they get - the more pressure there is to make money.

When you're in charge of a billion-dollar valuation company which is expected to remain unprofitable by 2029, it's hard to find a topic more crucial and intriguing than growth and making more money.

And yes, it is a recurring theme for vendors to tune their products specifically for industry-standard benchmarks. I can't find any specific reason for them not to pay people for training their model to score 90% on these 113 python tasks, as it directly drives profits up, whereas not doing it will bring absolute nothing to the table - surely they have their own internal benchmarks which they can exclude from training data.

link

youoy 600 days ago

> If you were in charge of a large and well funded model, would you rather pay people to find and "cheat" on LLM benchmarks by training on them, or would you pay people to identify benchmarks and make reasonably sure they specifically get excluded from training data?

You should already know by now that economic incentives are not always aligned with science/knowledge...

This is the true alignment problem, not the AI alignment one hahaha

link

concordDance 600 days ago

The AI alignement problem and the people alignment problem are actually the same problem! :D

One is just a bit harder due to the less familiar mind "design".

link

carschno 600 days ago

They cannot be found out as long as there is no better evaluation. Sure, if they produce obvious nonsense, but the point of a systematic evaluation is exactly to overcome subjective impressions based on individual examples as a notion of quality.

Also, you are right that excluding test data from the training data improves your model. However, given the insane amounts of training data, this requires significant effort. If that additionally leads to your model performing worse in existing leaderboards, I doubt that (commercial) organizations would pay for such an effort.

And again, as long as there is no better evaluation method, you still won't know how much it really helps.

link

KeplerBoy 600 days ago

This market is all about hype and mindshare, proper testing is hard and not performed by individuals, so there are no incentives not to train a bit on the test set.

link

gershy 600 days ago

And if there is a board that will fire you if expected profits do not increase, do you still maintain this stance?

link