Hacker News new | ask | show | jobs
by pclmulqdq 1068 days ago
Nobody has any evidence here. I'm saying that the incentives are such that the null hypothesis should be the opposite of what you think.
1 comments

Your entire argument, Your incentives hinge on "OpenAI's main product is "LLM that benchmarks the best."" which is a particularly silly assertion when Open AI did not release benchmark evaluatios for 3.5 for months. Not when the product was released. Not even when the API was released.
You don't have to release official numbers to run benchmarks. You also don't have to own the LLM to run benchmarks. Within hours of GPT-4's emergence, many benchmarks had been run.
You said their main product was "LLMs that benchmark the best" like benchmarking was some important aspect of marketing. It's not. That's fact. You can't say it's this hugely important thing and conveniently leave out they make near zero effort to do anything with it.

Basically the only people running benchmarks that could have been gamed on GPT-4 were other researchers, not companies, customers or users looking to use a product.

Normal users are certainly not running benchmarks and companies running benchmarks are running ones on internal data, which just defeats the whole point of gaming these research benchmarks.