|
|
|
|
|
by jo909
600 days ago
|
|
> Couldn't LLM provider just fine-tune their model for these tasks specifically - since they are static - to get ad value? They could. They would easily be found out as they loose in real world usage or improved new unique benchmarks. If you were in charge of a large and well funded model, would you rather pay people to find and "cheat" on LLM benchmarks by training on them, or would you pay people to identify benchmarks and make reasonably sure they specifically get excluded from training data? I would exclude them as well as possible so I get feedback on how "real" any model improvement is. I need to develop real world improvements in the end, and any short term gain in usage by cheating in benchmarks seems very foolish. |
|
When you're in charge of a billion-dollar valuation company which is expected to remain unprofitable by 2029, it's hard to find a topic more crucial and intriguing than growth and making more money.
And yes, it is a recurring theme for vendors to tune their products specifically for industry-standard benchmarks. I can't find any specific reason for them not to pay people for training their model to score 90% on these 113 python tasks, as it directly drives profits up, whereas not doing it will bring absolute nothing to the table - surely they have their own internal benchmarks which they can exclude from training data.